SlideShare a Scribd company logo
1 of 11
Download to read offline
MF-Retarget: Aggregate Awareness in Multiple Fact
            Table Schema Data Warehouses

               Karin Becker, Duncan Dubugras Ruiz, and Kellyne Santos

       Faculdade de Informática – Pontifícia Universidade Católica do Rio Grande do Sul
                       http://www.inf.pucrs.br/{~kbecker | ~duncan}
             {kbecker, duncan} @inf.pucrs.br, kellyne@ufs.br



       Abstract. Performance is a critical issue in Data Warehouse systems (DWs),
       due to the large amounts of data manipulated, and the type of analysis
       performed. A common technique used to improve performance is the use of
       pre-computed aggregate data, but the use of aggregates must be transparent for
       DW users. In this work, we present MF-Retarget, a query retargeting
       mechanism that deals with both conventional star schemas and multiple fact
       table (MFT) schemas. This type of multidimensional schema is often used to
       implement a DW using distinct, but interrelated Data Marts. The paper presents
       the retargeting algorithm and initial performance tests.




1    Introduction

Data warehouses (DW) are analytical databases aimed at providing intuitive access to
information useful for decision-making processes. A Data Mart (DM), often referred
to as a subject-oriented DW, represents a subset of the DW, comprised of relevant
data for a particular business function (e.g. marketing, sales). DW/DM handle large
volumes of data, and they are often designed using a star schema, which contains
relatively few tables and well-defined join paths. On-line Analytical Processing
(OLAP) systems are the predominant front-end tools used in DW environments,
which typically explore this multidimensional data structure [3, 13]. OLAP operations
(e.g. drill down, roll up, slice and dice) typically result in SQL queries in which
aggregation functions (e.g. SUM, COUNT) are applied to fact table attributes, using
dimension table attributes as grouping columns (group by clause).
   A multiple fact tables (MFT) schema is a variation of the star schema, in which
there are several fact tables, necessary to represent unrelated facts, facts of different
granularity, or even to improve performance [10]. A major use of MFT schemas is to
implement a DW through a set of distributed subject-oriented DMs [8, 10], preferably
related through a set of conformed dimensions [6], i.e. dimensions that have the same
meaning at every possible fact table to which it can be joined. In such architecture, a
major responsibility of the central DW design team is to establish, publish and
enforce the conformed dimensions. However, these efforts of the design team are not
enough to guarantee the ease combination, by end users, of facts coming from more
than one DM. Indeed, the straightforward join of facts and dimensions in MFT
schemas imposes a number of restrictions, which are not always possible to be


Y. Manolopoulos and P. Návrat (Eds): ADBIS 2002, pp. 41-51, 2002.
42       Karin Becker et al.


observed, otherwise, one risks to produce incorrect results. Most users do not have the
technical skills for realising the involved subtleties and their implications in terms of
query formulation. Therefore, for most users, queries involving MFT schemas are
more easily handled through appropriate interfaces or specific applications that hide
from them all difficulties involved.
   In this paper, we propose MF-Retarget, a query retargeting mechanism that handles
MFT schemas and which is additionally aggregate aware. Indeed, precomputed
aggregation is one of the most efficient performance strategies to solve queries in DW
environments [8]. The retargeting service provides users with transparency from both
an aggregate retargeting perspective (aggregate unawareness) and multiple fact tables’
schema complexity perspective, freeing users from query formulation idiosyncrasies.
The algorithm is generic to work properly regardless the number of fact tables
involved in the query.
   The remainder of this paper is structured as follows. Section 2 presents related
work on the use of aggregates. The retargeting algorithm is described in Section 3,
and Section 4 presents some initial performance tests. Conclusions and future work
are addressed in Section 5.


2    Related Work

2.1 Computation of Aggregates

In a DW, most frequently users are interested in some level of summarisation of
original data. One of the most efficient strategies for handling this problem is the use
of pre-computed aggregates for the combination of dimensions/dimension attributes
providing the greatest benefit for answering queries (e.g. frequent or expensive
queries) [4, 8,15, 16].
   The computation of aggregates can be dynamic or static. In the former case, it is up
to the OLAP tool or database engine to decide which aggregates are “beneficial”, a
concept that varies from tool to tool. Works such as [1, 2, 4, 5, 7, 14] address dynamic
computation of aggregates. These approaches are different from the static context in
that not only the cost of executing the query is considered, but also
maintenance/reorganisation costs which takes place as query is processed [2, 5]. In
the static context, aggregates are created off-line, and therefore maintenance/
reorganisation costs are not that critical. It should be clear that dynamic and static
aggregate computations are complementary mechanisms. The first addresses
performance tuning from a technical perspective. The latter, addressed in this paper, is
essential from a corporate point of view.
   Organisationally, static aggregate computation is fundamental because aggregates
are created based on corporate decisional requirements, prioritising types of analysis
or types of users. Of course, decisional support requirements vary overtime, so it is
fundamental that the DBA monitors the use of the analytical database in order to
revise the necessity of existing and/or new aggregates.
   Design alternatives for representing aggregates are extensively discussed in
pragmatic literature such as [6, 10]. Storing each aggregate in its own fact table
MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses     43


presents many advantages in terms of easiness of manipulation, maintenance,
performance and storage requirements. Aggregation also leads to smaller
representations of dimensions, commonly referred to as shrunken dimensions.
Aggregates should, whenever possible, refer to shrunken dimensions, instead of
original dimensions. A shrunken dimension is commonly stored in a separate table
with its own primary key.
   User tools or applications should not reference the aggregate to be used in SQL
queries. First, it must be possible to include/remove aggregates without affecting
users or existing applications. Second, users cannot be in charge of performance
improvement by the selection of the appropriate aggregate.


2.2 Aggregate Retargeting Services

There are three major options where query-retargeting services can be located in the
DW architecture: the desktop, the OLAP server or the database engine [16]. The
query retargeting service can also be located in between these layers, in case no
access to the DBMS engine/OLAP tool source code is provided. Most works in the
literature (e.g. [1, 2, 3, 4, 5, 7, 14]) focus on dynamic computation of aggregates,
considering strategies that are embedded in query processors, such that the retargeting
service can change completely the query execution plan. Dynamic aggregation also
considers a specific moment of user analysis (e.g. a sequence of related drills), and
not the organisational requirements as a whole.
    Kimball et al. [6] sketch a query-retargeting algorithm for statically pre-computed
aggregates, which could be inserted as a layer between front-end toll and
OLAP/server DBMS engine. The algorithm is based on the concept of “family of
schemas”, composed of one base fact table and all of its related aggregate tables. One
of the advantages of such algorithm is that it requires very few metadata, basically the
size of each fact table and the available attributes for each aggregate. In this paper, we
extend this algorithm to deal with MFT schemas. Such an extension is useful for DW
architectures implemented by a set of subject-oriented set of DMs, in which users
wishes to performed both separate and integrated analysis.


3     MF-Retarget

The striking feature of MF-Retarget is its ability to handle MFT schemas with the use
of aggregates, yet providing total transparency for users. The joining of several fact
tables requires in the general case that each individual table must be summarised
individually first until all tables are in the same summarising level (exactly the same
dimensions), and then joined. However, most users do not have the technical skills for
realising the problems involved, nor the requirements in terms of query formulation.
See [11] for a deeper discussion on the subtleties involved. Additionally, it should
benefit from the use of aggregates as a query performance tuning mechanism. Hence,
transparency in the context of MFT schemas considered must have a twofold
meaning: a) aggregate unawareness, and b) MFT join complexity unawareness.
44       Karin Becker et al.


   MF-Retarget is a retargeting service intended to lie between the front-end tool and
the DBMS, which accepts as input a query written by a user through a proper user
interface (e.g. a graphical one intended for naive users, a specific application). The
algorithm assumes that:
   − Users are unaware of MFT joining complexities, and always write a single
      query in terms of desired facts and dimensions. The retargeting service is
      responsible for rewriting the query to produce correct results, assuming as a
      premise that it is always necessary to bring each fact table to the same
      summarisation level before joining them.
   − Users are unaware of the existence of aggregates, and always formulate the
      query in terms of the original base tables and dimensions. The retargeting
      service is responsible for rewriting the query in terms of aggregates, if possible.
   − Retargeting queries involving a single fact table is a special case of MFT
      schemas, and therefore, the algorithm should provide good results in both cases.
   The remainder of this section addresses an illustration scenario and describes the
algorithm. Further details on the algorithm, the required metadata and MF_Retarget
prototype can be obtained in [11].


3.3 Algorithm Illustration

To illustrate the functioning of the algorithm, let us consider the example depicted in
Figure 1, a simplification of the MFT schema proposed in the APB-1 OLAP Council
benchmark [9]. For each fact table the fields prefixed by * compose its primary key.
In dimension tables, only one field, the lower one in the hierarchy, composes its
primary key (also prefixed by *). The branches show the referential integrity from a
fact or aggregate table for each of its dimensions. The MFT schema of Figure 1(a)
shows two fact tables (Sales and Inventory), related by three conformed dimensions:
Customer, TimeDim and Product. Sales have an additional dimension, namely
Channel. Also, Figure 1(a) shows some possible aggregates for this schema. In the
picture, grey boxes correspond to shrunken dimensions, i.e. hierarchic dimensions
without one or more lower-level fields.
   For example, consider a user wishes to analyse comparatively quarterly Sales of
product divisions, with the corresponding status of the Inventory. This query cannot
be answered simply by joining facts of distinct tables, because these facts represent
information on different granularity, and therefore, they should be brought to the
same summarisation level before they can be joined, otherwise inaccurate results will
be produced. To free the user from the difficulties involved in MFT schemas, MF-
Retarget assumes the user states a single query in terms of facts, dimensions and
desired aggregation level (the input shown in Figure 2, for the example considered).
The retargeting mechanism has then two goals: to correct the query, and try to make it
more efficient with the use of aggregates.
   Considering the aggregates illustrated in Figure 1(a), the algorithm realises that
Aggregate4 is the best candidate to answer the question, because it contains all
necessary data, is the smaller one, and already joins Sales and Inventory tables (in
that order). In the absence of Aggregate4, Aggregate1 and Aggregate2 will be used.
If the algorithm does not find any aggregate that can answer the query in a more
MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses                                    45
                                                                TimeDim

                                                                  Year
                              Sales                             Quarter                                Inventory
                                                   Customer     *Month
                             *Cust_ID                                                                   *Cust_ID
     Channel                 *Prod_ID                                                                   *Prod_ID
                                                   Retailer
                             *Chan_ID               *Store                                              *Time_ID
      *Base                  *Time_ID                          Product                                 StockUnits
                             UnitsSold
                            DollarSales                       Division
                                                                Line
                                                               Family
                                                                Group
                                                                Class                                  Aggregate3
                                                                *Code
                                                                                                        *Prod_ID
                                                                                                        *Time_ID
    Aggregate1                                                                                         StockUnits
     *Cust_ID                                   Sh_Customer   Aggregate2
     *Prod_ID
     *Chan_ID                                   *Retailer      *Cust_ID
                               Sh_TimeDim
     *Time_ID                                                  *Prod_ID
     UnitsSold                                                 *Time_ID
                                  Year
    DollarSales                 *Quarter                      StockUnits                                      (a)


                                                                                Sales           Inventory
    Sh_Product                    Aggregate4

     Division                       *Prod_ID                               Aggregate1     Aggregate2     Aggregate3
       *Line                        *Time_ID
                                    UnitsSold
                                  DollarSales                                           Aggregate4
                                   StockUnits                                                                  (b)


              Fig. 1. MFT schema and possible aggregates, and schema derivation graph


efficient way, at least it transforms the query to produce correct results. Figure 3
shows the results from the algorithm for these three situations.
   It should be pointed out that the best aggregate is not always the one that already
joins distinct fact tables. Indeed, in case smaller individual aggregates exist, the cost
of joining them can be smaller than the cost of summarising a much bigger joined pre-
computed aggregate.


3.4 The Algorithm

The algorithm assumes that users have to inform only the tables (fact/dimensions), the
grouping columns (which are the same ones listed in the select clause), the
summarisation functions applied to the measurements, and possibly additional
restrictions in the where clause. It considers the following restrictions to input
queries: a) monoblock queries (select from where group by); b) only transitive
aggregation functions are used; c) all dimensions listed in the from clause apply to all
fact tables listed.
   For the algorithm, the relationship between schemas is represented by a directed
acyclic graph G(V, E). In the graph, V represents a set of star schemas, and E
corresponds to set of derivation relationships between any two schemas. The edges of
E form derivation paths, meaning that schema at the end of any path could be derived
by the aggregation of schema related to the start of that path. The use of graph
structures for representing relationships between aggregates is well known [4, 13].
Figure 1(b) presents the derivation graph for the example of Figure 1(a). We assume
that only transitive aggregation functions (i.e. SUM, MAX and MIN) are used in both
the derivation relationships and queries.
46       Karin Becker et al.


          “The units sold and units in stock, per quarter and product division”
Select P.Division Division, T.Quarter Quarter, SUM(S.UnitsSold) UnitsSold,
SUM(I.StockUnits) StockUnits From TimeDim T, Product P, Sales S, Inventory I
Where T.Month=S.Time_ID and P.Code=S.Prod_ID and T.Month=I.Time_ID
   and P.Code=I.Prod_ID Group by P.Division, T.Quarter

                       Fig. 2. Input SQL query from a naive DW user

a) considering the existence of Aggregate4:
Select P.Division Division, T.Quarter Quarter, SUM (UnitsSold) UnitsSold,
SUM (StockUnits) StockUnits From Sh_TimeDim T, Sh_Product P, Aggregate4 A4
Where T.Quarter=A4.Time_ID and P.Line=A4.Prod_ID Group by P.Division, T.Quarter

b) in the absence of Aggregate4:
Create view V1 (Division, Quarter, UnitsSold) as Select P.Division, T.Quarter,
SUM (S.UnitsSold) From Sh_TimeDim T, Product P, Aggregate1 A1
Where T.Quarter=A1.Time_ID and P.Code=A1.Prod_ID Group by P.Division, T.Quarter

Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter,
SUM (I.StockUnits) From Sh_TimeDim T, Product P, Aggregate2 A2
Where T.Quarter=A2.Time_ID and P.Code=A2.Prod_ID Group by P.Division, T.Quarter
Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnits
From V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter

c) if no aggregates are found:
Create view V1 (Division, Quarter, UnitsSold) as
Select P.Division, T.Quarter, SUM (S.UnitsSold) From TimeDim T, Product P, Sales S
Where T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter

Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter,
SUM (I.StockUnits) From TimeDim T, Product P, Inventory I
Where T.Month=I.Time_ID and P.Code= I.Prod_ID Group by P.Division, T.Quarter

Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnits
From V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter

                          Fig. 3. Possible outputs of the algorithm

   The algorithm is divided into 4 steps, which for clarity purposes are individually
presented and illustrated using the example of Section 3.1:
   1. Divide the original query into component queries;
   2. For each component query, select candidate schema(s) for answering the query;
   3. Select best candidates.
   4. Rewrite the query

Step 1: Division into Component Queries.
For each fact table Fi listed in the from clause of the original query Q, a component
query Ci (i>0) is created, according to the following algorithm:
1. For each fact table Fi listed in the from clause of Q, create a component query
    Ci such that:
    1.1. Ci from clause := Fi and all dimensions listed in the from clause of Q;
    1.2. Ci where clause := all join conditions of Q necessary to relate Fi to the
         dimensions, together with any additional conditions involving these
         dimensions or Fi;
    1.3. Ci group by clause := all attributes used in the group by clause of Q;
    1.4. Ci select clause := all attributes used in the group by clause of Q, in
         addition to all aggregation function(s) applied to Fi attributes.
MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses    47


Notice that a query referring a single fact table is treated as a special case of queries
involving several fact tables. In that case, Step 1 produces a single component query
that is equal to the original query Q. Figure 4 shows the component queries created
for the query input illustrated in Figure 2.

C1 -> Select P.Division, T.Quarter, SUM (S.UnitsSold) UnitsSold
From TimeDim T, Product P, Sales S
Where T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter
C2 -> Select P.Division, T.Quarter, SUM (I.StockUnits) StockUnits
From TimeDim T, Product P, Inventory I
Where T.Month=I.Time_ID and P.Code=I.Prod_ID Group by P.Division, T.Quarter

                      Fig. 4. Component queries for input of Figure 2


Step 2: Candidates for Component Queries
This step generates for each component query Ci (i>0) resulting from Step 1 the
respective candidate set CSi. Each candidate belonging to CSi is a schema (base or
aggregate) that answers Ci.
2    For each component query Ci, generated in Step 1:
     2.1. Let n:= the node that corresponds to the base schema of Ci; mark n as
          “visited”; let CSi := n;
     2.2. Using a depth-first traversing, examine all schemas derived from n until all
          nodes that can be reach from it are marked as “visited”;
         2.2.1. Let n := next node; mark n as “visited”;
         2.2.2. If all query attributes (select and where clauses) of Ci belong to
                schema n and each aggregation function of the select clause of Ci is
                exactly the same of one used in the fact table of schema n
                 Then CSi := CSi ∪ n;
                 Else Mark all nodes that can be reached from n as “visited”.

   Each time this step is executed for a component query, the graph is traversed using
a depth-first algorithm starting from the corresponding base schema. When the
algorithm detects that a schema cannot answer the component query, all schemas at
the end of a derivation path starting from it are disregarded. Each CSi element is a
valid candidate to replace Fi in the from clause of Q. Therefore, every tuple (e1, …,
en), where e1 ∈ CS1, …, en ∈ CSn, are valid combinations for the rewrite of Q.
Considering the graph depicted in Figure 1(b), and the component queries of Figure 4:
   − CS1 = {Sales, Aggregate1, Aggregate4} for component query C1;
   − CS2 = {Inventory, Aggregate2, Aggregate3, Aggregate4} for C2.

Step 3: Selection of Best Candidates
Let T be a set of tuples (e1, …, en), where e1 ∈ CS1, …, en ∈ CSn, (n>0), representing
the Cartesian Product of candidate sets CS1 X .. X CSn. Let t be a tuple of T. The
present version of the algorithm bases this choice on the concept of accumulated size
to choose the best candidate. The accumulated size of t(e1, …, en), AS(t), is a function
that returns the sum of records that must be handled if the query were rewritten using
t. For summing the number of records, AS(t) computes only once the size of a given
48       Karin Becker et al.


table, in case it is included in more than one candidate set CSi. Thus, if Aggregate4 is
chosen only its records need to be processed, and only once. In all other cases, records
from the different fact tables in t are processed, considering each table only once.
   This may suggest that, in a multi-fact query, the best t will ever be the one where
e1 = …= en. However, this is not true. Indeed, the cost of processing more records
(I/O cost) has a stronger impact than the cost of joining tables. Notice that this step
can be improved in many ways, by varying the cost function used to prioritise the
candidates for query rewrite. An immediate improvement of this function is the
consideration of index information to be combined with table size.

3    Consider all CSi sets generated in Step 2 and T, the Cartesian Product of
     candidate sets CS1 X .. X CSn
     3.1. t' := t(e1, …, en) ∈ T with the smallest accumulated AS(t), considering all
          t(e1, …, en) ∈ T, where e1 ∈ CS1, …, en ∈ CSn;

Step 4: Query Reformulation
Once the best candidate for each component query is determined, the query is
rewritten. If the set of best candidates resulting from Step 3 has a single element, i.e.
a common aggregate for all component queries, a single query is written using such
aggregate and respective shrunken dimensions. This is the case for our example,
where Figure 3(a) displays the rewritten query. Otherwise, the query is rewritten in
terms of views that summarise the best aggregates individually, and then join them
(e.g. as in Figure 3(b) and (c)). If there is a common best candidate that answers more
than one component query, but not all of them, a single view is created for that set of
component queries. This trivial algorithm is not presented due to space limitations.


4    Tests
Initial performance tests were performed based on the MFT schema presented in the
APB-1 OLAP Benchmark [9], which comprises 4 fact tables. For the tests, we used
Inventory, Sales and corresponding dimensions, as depicted in Figure 1(a). The
aggregates were defined to experiment performance under different levels of
aggregation (compression factor). We did not use the semantics of the aggregates, nor
user requirements expressed in the benchmark (e.g. queries). We also disregarded the
number of records of the resulting database for aggregate selection. Two tests were
executed, referred to Test1 and Test2, described in the remaining of this section.

4.1 Test1
The goal of Test1 was to verify whether the algorithm performed well and correctly,
considering both star and MFT schemas. APB-1 program was executed with
parameters 10, 0.1 and 10, which resulted in a Sales table with 1,239,300 records,
whereas Inventory comprised 270,000 records. We executed five queries, three of
them involving a single fact table (Sales), and two of them the join of both fact tables.
For each query, we calculated the number of records of the resulting table, and the
processing time considering all possible alternatives.
MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses     49


   It was possible to verify that the algorithm always chose the aggregate with the
smallest processing time, regardless whether the query involved a single fact table or
a join of multiple fact tables. Proportionally, there was a significant gain in the vast
majority of cases, but absolute gains were not always significant. The magnitude of
performance gain seems to be a function of the (fact) table size, aggregate
compression factor, and output table size.

4.2 Test2
Test2 was executed running APB-1 with                       Inventory        Sales
parameters 10, 1 and 10. The derivation
graph for this test is depicted in Figure 5,
                                                               I1              S1
and Table 1 describes the properties of the
schemas: number of records, compression
factor (CF) with regard to both the base                               I2S2
fact table and its deriving schema(s), and                     I3             S3
the difference between the derived/
deriving schemas in terms of dimensions.           Fig. 5. Derivation graph used in Test2
   We executed a single query that
involved facts from both Sales and Inventory tables, and which could be answered by
(the combination of) all aggregates. The goal was to compare the respective absolute
processing times. Table 2 displays in the first row the elapsed time measured for each
execution. Subsequent rows show the gains in using an aggregate with regard to
bigger alternatives. For instance, the use of aggregate I2S2 represents a gain of 52%
with respect to the join of I1 and S1, calculated as (time(I1 and S1) –
time(I2S2))/time(I1 and S1), and 96% with regard to the join of Inventory and
Sales. It is possible to verify that the gains considering absolute times are very
significant this time. The use of the accumulated size AS(t) as the main criterion to
prioritise aggregate candidates seems to be simple but efficient, although it still can be
improved in many ways, particularly indexing.


5     Conclusions

In this work we presented MF-Retarget, a retargeting mechanism that deals with both
MFT schemas and statically computed aggregates. The algorithm provides two types
of transparency:
     a) aggregate unawareness, and
     b) users are spared from the complexities of queries in MFT schemas.

   This retargeting service is intended to be implemented as a layer between user
front-end tool and DBMS engine. Thus, it can be complementary to gains already
provided by OLAP tools/DBMS engines in the context of dynamic computation of
aggregates. Further details on the implementation can be obtained in [11].
50         Karin Becker et al.


                         Tab. 1. Inventory/Sales derivation graph description

     Table or         Records       CF (base        Deriving      CF(Deriving    Shrunken     Eliminated
     Aggregate                      schema)         schema        schema)        Dims.        dims.
       Sales (S)       13,122,000
          S1            2,400,948      18.3 %          Sales        18.3 %             Yes           Yes
          S3              614,520      4.7 %            S1          25.6%              Yes
     Inventory (I)     12,396,150
          I1            2,496,762      20.1%        Inventory       20.1%              Yes
          I3              631,800      5.1 %           I1           25.3 %             Yes
         I2S2           2,400,948    18.3 % (S)     S1 and I1      96.2% I1
                                     19.4 % (I)                    100 % S1


                                    Tab. 2. Results with larger tables
                     Query1:         614,520 rec.
                                      Inventory and Sales I1 and S1           I2S2       I3 and S3
         Time (hours:min:sec)               20:35:43            1:27:04      0:42:09 0:15:22
         Inventory/Sales (%)                                      93           96      99
         I1 and S1 (%)                                                         52      82
         I2S2 (%)                                                                      64
         AS(t)                             25,518,150           4,897,710 2,400,948 1,246,320



   Preliminary tests confirmed the algorithm always provided the best response time.
Proportional gains are always significant, but absolute gains increase with bigger fact
tables. It is obvious that additional tests are required to determine precise directives
for the construction of aggregates in MFT schemas, and under which circumstances
the processing gains are significant. It is also important to refine our criteria for
selecting the best candidate. It is also important to use indexing information in
addition to table number of records for aggregate selection.
   Future work also includes, among other topics, the better definition of function
costs for prioritising candidate aggregates, use of indexes in the function cost, the
integration of the retargeting mechanism into a DW architecture, support for
aggregates monitoring and recommendation for aggregates reorganisation, the use of
the proposed algorithm in the context of dynamic aggregate computation.


Acknowledgements

This work was partially financed by FAPERGS, Brazil.


References
1. Baralis, E., Paraboshi, S., Teniente, E. Materialized Views Selection in a Multidimensional
   Database. Proceedings of the VLDB’97 (1997). 156-165.
2. Chaudhuri, S., Shim, K. An Overview of Cost-Based Optimization of Queries with
   Aggregates. Bulletin of TCDE (IEEE), v. 18 n. 3 (1995). 3-9.
MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses           51


3. Gray, J., Chaudhuri, S. et al. Data cube: a relational aggregation operator generalizing
   Group-by, Cross-tab and Subtotals. Data Mining and Knowledge Discovery, v. 1, n. 1
   (1997) 29-53.
4. Gupta, A., Harinarayan, V., Quass, D. Aggregate-query Processing in Data Warehousing
   Environments. In: Proceedings of the VLDB’95 (1995). 358-369.
5. Gupta, H.; Mumick, I. Selection of views to materialize under a maintenance cost constraint.
   Proceedings of the ICDT (1999). 453-470.
6. Kimball, R. et al. The Data Warehouse Lifecycle Toolkit : expert methods for designing,
   developing, and deploying data warehouses. John Wiley & Sons, (1998)
7. Kotodis, Y., Roussopoulos, N. Dynamat : A Dynamic View Management System for Data
   Warehouse. Proceedings of the ACM SIGMOD 1999. (1999) 371-382
8. Meredith, M., Khader, A. Divide and Aggregate: designing large warehouses. Database
   Programming & Design. June, (1996). 24-30.
9. OLAP Council. APB-1 OLAP Benchmark (Release II). Online. Captured in May 2000.
   Available at : http://www.olapcouncil.org/research/bmarkco.htm, Nov. 1998.
10.Poe, V., Klauer, P., Brobst, S. Building a Data Warehouse for Decision Support 2nd edition.
   Prentice Hall (1998).
11.Santos, K. MF-Retarget: a multiple-fact table aggregate retargeting mechanism. MSc.
   Dissertation. Faculdade de Informatica - PUCRS, Brazil. (2000). (in Portuguese)
12.Sapia, C. On modeling and Predicting Query Behavior in OLAP Systems. Proceedings of
   the DMDW’99. (1999)
13.Sarawagi, S. Agrawal, R. Gupta, A. On Computing the Data Cube. IBM Almaden Research
   Center : Technical Report. San Jose, CA (1996).
14.Srivasta, D., et al. Answering Queries with Aggregation Using Views. Proceedings of the
   VLDB’96 (1996). 318-329
15.Wekind, H., et al. Preaggregation in Multidimensional Data Warehouse Environments.
   Proceedings of the ISDSS'97 (1997). 581-59
16.Winter, R. Be Aggregate Aware. Intelligent Enterprise Magazine, v. 2 n. 13 (1999).

More Related Content

What's hot

Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databasesyazad dumasia
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??Abdul Aslam
 
Maintaining aggregates
Maintaining aggregatesMaintaining aggregates
Maintaining aggregatesSirisha Kumari
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data modeljagdish_93
 
An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008Klaudiia Jacome
 
Explain the explain_plan
Explain the explain_planExplain the explain_plan
Explain the explain_planMaria Colgan
 
EA261_2015_Exercises
EA261_2015_ExercisesEA261_2015_Exercises
EA261_2015_ExercisesLuc Vanrobays
 
Ground Breakers Romania: Explain the explain_plan
Ground Breakers Romania: Explain the explain_planGround Breakers Romania: Explain the explain_plan
Ground Breakers Romania: Explain the explain_planMaria Colgan
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schemaChaand Chopra
 
Useful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesUseful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesMaria Colgan
 
Data warehousing unit 6.2
Data warehousing unit 6.2Data warehousing unit 6.2
Data warehousing unit 6.2WE-IT TUTORIALS
 
James Jara Portfolio 2014 - Enterprise datagrid - Part 3
James Jara Portfolio 2014  - Enterprise datagrid - Part 3James Jara Portfolio 2014  - Enterprise datagrid - Part 3
James Jara Portfolio 2014 - Enterprise datagrid - Part 3James Jara
 
Business Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseBusiness Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseRussel Chowdhury
 

What's hot (20)

Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databases
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
 
Maintaining aggregates
Maintaining aggregatesMaintaining aggregates
Maintaining aggregates
 
E miner
E minerE miner
E miner
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Predictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise MinerPredictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise Miner
 
An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008An introduction to new data warehouse scalability features in sql server 2008
An introduction to new data warehouse scalability features in sql server 2008
 
Part5 sql tune
Part5 sql tunePart5 sql tune
Part5 sql tune
 
Explain the explain_plan
Explain the explain_planExplain the explain_plan
Explain the explain_plan
 
CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
EA261_2015_Exercises
EA261_2015_ExercisesEA261_2015_Exercises
EA261_2015_Exercises
 
Ground Breakers Romania: Explain the explain_plan
Ground Breakers Romania: Explain the explain_planGround Breakers Romania: Explain the explain_plan
Ground Breakers Romania: Explain the explain_plan
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schema
 
Teradata imp
Teradata impTeradata imp
Teradata imp
 
Useful PL/SQL Supplied Packages
Useful PL/SQL Supplied PackagesUseful PL/SQL Supplied Packages
Useful PL/SQL Supplied Packages
 
Calnf
CalnfCalnf
Calnf
 
Data warehousing unit 6.2
Data warehousing unit 6.2Data warehousing unit 6.2
Data warehousing unit 6.2
 
James Jara Portfolio 2014 - Enterprise datagrid - Part 3
James Jara Portfolio 2014  - Enterprise datagrid - Part 3James Jara Portfolio 2014  - Enterprise datagrid - Part 3
James Jara Portfolio 2014 - Enterprise datagrid - Part 3
 
Business Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseBusiness Intelligence and Multidimensional Database
Business Intelligence and Multidimensional Database
 
Hyperion Planning Overview
Hyperion Planning OverviewHyperion Planning Overview
Hyperion Planning Overview
 

Similar to Aggreagate awareness

High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...IRJET Journal
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisIRJET Journal
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfRayWill4
 
An efficient resource sharing technique for multi-tenant databases
An efficient resource sharing technique for multi-tenant databases An efficient resource sharing technique for multi-tenant databases
An efficient resource sharing technique for multi-tenant databases IJECEIAES
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow applicationIJITE
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET Journal
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systemsunyil96
 
Electrical distribution system planning
Electrical distribution system planningElectrical distribution system planning
Electrical distribution system planningpradeepkumarchilakal
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...csandit
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...ijdms
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET Journal
 
Effective and Efficient Job Scheduling in Grid Computing
Effective and Efficient Job Scheduling in Grid ComputingEffective and Efficient Job Scheduling in Grid Computing
Effective and Efficient Job Scheduling in Grid ComputingAditya Kokadwar
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl systemijdms
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage Systemijcnes
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTieijjournal
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTieijjournal1
 

Similar to Aggreagate awareness (20)

High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data Analysis
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdf
 
Facade
FacadeFacade
Facade
 
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
 
An efficient resource sharing technique for multi-tenant databases
An efficient resource sharing technique for multi-tenant databases An efficient resource sharing technique for multi-tenant databases
An efficient resource sharing technique for multi-tenant databases
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow application
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
 
133
133133
133
 
Electrical distribution system planning
Electrical distribution system planningElectrical distribution system planning
Electrical distribution system planning
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
 
Effective and Efficient Job Scheduling in Grid Computing
Effective and Efficient Job Scheduling in Grid ComputingEffective and Efficient Job Scheduling in Grid Computing
Effective and Efficient Job Scheduling in Grid Computing
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage System
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Aggreagate awareness

  • 1. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses Karin Becker, Duncan Dubugras Ruiz, and Kellyne Santos Faculdade de Informática – Pontifícia Universidade Católica do Rio Grande do Sul http://www.inf.pucrs.br/{~kbecker | ~duncan} {kbecker, duncan} @inf.pucrs.br, kellyne@ufs.br Abstract. Performance is a critical issue in Data Warehouse systems (DWs), due to the large amounts of data manipulated, and the type of analysis performed. A common technique used to improve performance is the use of pre-computed aggregate data, but the use of aggregates must be transparent for DW users. In this work, we present MF-Retarget, a query retargeting mechanism that deals with both conventional star schemas and multiple fact table (MFT) schemas. This type of multidimensional schema is often used to implement a DW using distinct, but interrelated Data Marts. The paper presents the retargeting algorithm and initial performance tests. 1 Introduction Data warehouses (DW) are analytical databases aimed at providing intuitive access to information useful for decision-making processes. A Data Mart (DM), often referred to as a subject-oriented DW, represents a subset of the DW, comprised of relevant data for a particular business function (e.g. marketing, sales). DW/DM handle large volumes of data, and they are often designed using a star schema, which contains relatively few tables and well-defined join paths. On-line Analytical Processing (OLAP) systems are the predominant front-end tools used in DW environments, which typically explore this multidimensional data structure [3, 13]. OLAP operations (e.g. drill down, roll up, slice and dice) typically result in SQL queries in which aggregation functions (e.g. SUM, COUNT) are applied to fact table attributes, using dimension table attributes as grouping columns (group by clause). A multiple fact tables (MFT) schema is a variation of the star schema, in which there are several fact tables, necessary to represent unrelated facts, facts of different granularity, or even to improve performance [10]. A major use of MFT schemas is to implement a DW through a set of distributed subject-oriented DMs [8, 10], preferably related through a set of conformed dimensions [6], i.e. dimensions that have the same meaning at every possible fact table to which it can be joined. In such architecture, a major responsibility of the central DW design team is to establish, publish and enforce the conformed dimensions. However, these efforts of the design team are not enough to guarantee the ease combination, by end users, of facts coming from more than one DM. Indeed, the straightforward join of facts and dimensions in MFT schemas imposes a number of restrictions, which are not always possible to be Y. Manolopoulos and P. Návrat (Eds): ADBIS 2002, pp. 41-51, 2002.
  • 2. 42 Karin Becker et al. observed, otherwise, one risks to produce incorrect results. Most users do not have the technical skills for realising the involved subtleties and their implications in terms of query formulation. Therefore, for most users, queries involving MFT schemas are more easily handled through appropriate interfaces or specific applications that hide from them all difficulties involved. In this paper, we propose MF-Retarget, a query retargeting mechanism that handles MFT schemas and which is additionally aggregate aware. Indeed, precomputed aggregation is one of the most efficient performance strategies to solve queries in DW environments [8]. The retargeting service provides users with transparency from both an aggregate retargeting perspective (aggregate unawareness) and multiple fact tables’ schema complexity perspective, freeing users from query formulation idiosyncrasies. The algorithm is generic to work properly regardless the number of fact tables involved in the query. The remainder of this paper is structured as follows. Section 2 presents related work on the use of aggregates. The retargeting algorithm is described in Section 3, and Section 4 presents some initial performance tests. Conclusions and future work are addressed in Section 5. 2 Related Work 2.1 Computation of Aggregates In a DW, most frequently users are interested in some level of summarisation of original data. One of the most efficient strategies for handling this problem is the use of pre-computed aggregates for the combination of dimensions/dimension attributes providing the greatest benefit for answering queries (e.g. frequent or expensive queries) [4, 8,15, 16]. The computation of aggregates can be dynamic or static. In the former case, it is up to the OLAP tool or database engine to decide which aggregates are “beneficial”, a concept that varies from tool to tool. Works such as [1, 2, 4, 5, 7, 14] address dynamic computation of aggregates. These approaches are different from the static context in that not only the cost of executing the query is considered, but also maintenance/reorganisation costs which takes place as query is processed [2, 5]. In the static context, aggregates are created off-line, and therefore maintenance/ reorganisation costs are not that critical. It should be clear that dynamic and static aggregate computations are complementary mechanisms. The first addresses performance tuning from a technical perspective. The latter, addressed in this paper, is essential from a corporate point of view. Organisationally, static aggregate computation is fundamental because aggregates are created based on corporate decisional requirements, prioritising types of analysis or types of users. Of course, decisional support requirements vary overtime, so it is fundamental that the DBA monitors the use of the analytical database in order to revise the necessity of existing and/or new aggregates. Design alternatives for representing aggregates are extensively discussed in pragmatic literature such as [6, 10]. Storing each aggregate in its own fact table
  • 3. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 43 presents many advantages in terms of easiness of manipulation, maintenance, performance and storage requirements. Aggregation also leads to smaller representations of dimensions, commonly referred to as shrunken dimensions. Aggregates should, whenever possible, refer to shrunken dimensions, instead of original dimensions. A shrunken dimension is commonly stored in a separate table with its own primary key. User tools or applications should not reference the aggregate to be used in SQL queries. First, it must be possible to include/remove aggregates without affecting users or existing applications. Second, users cannot be in charge of performance improvement by the selection of the appropriate aggregate. 2.2 Aggregate Retargeting Services There are three major options where query-retargeting services can be located in the DW architecture: the desktop, the OLAP server or the database engine [16]. The query retargeting service can also be located in between these layers, in case no access to the DBMS engine/OLAP tool source code is provided. Most works in the literature (e.g. [1, 2, 3, 4, 5, 7, 14]) focus on dynamic computation of aggregates, considering strategies that are embedded in query processors, such that the retargeting service can change completely the query execution plan. Dynamic aggregation also considers a specific moment of user analysis (e.g. a sequence of related drills), and not the organisational requirements as a whole. Kimball et al. [6] sketch a query-retargeting algorithm for statically pre-computed aggregates, which could be inserted as a layer between front-end toll and OLAP/server DBMS engine. The algorithm is based on the concept of “family of schemas”, composed of one base fact table and all of its related aggregate tables. One of the advantages of such algorithm is that it requires very few metadata, basically the size of each fact table and the available attributes for each aggregate. In this paper, we extend this algorithm to deal with MFT schemas. Such an extension is useful for DW architectures implemented by a set of subject-oriented set of DMs, in which users wishes to performed both separate and integrated analysis. 3 MF-Retarget The striking feature of MF-Retarget is its ability to handle MFT schemas with the use of aggregates, yet providing total transparency for users. The joining of several fact tables requires in the general case that each individual table must be summarised individually first until all tables are in the same summarising level (exactly the same dimensions), and then joined. However, most users do not have the technical skills for realising the problems involved, nor the requirements in terms of query formulation. See [11] for a deeper discussion on the subtleties involved. Additionally, it should benefit from the use of aggregates as a query performance tuning mechanism. Hence, transparency in the context of MFT schemas considered must have a twofold meaning: a) aggregate unawareness, and b) MFT join complexity unawareness.
  • 4. 44 Karin Becker et al. MF-Retarget is a retargeting service intended to lie between the front-end tool and the DBMS, which accepts as input a query written by a user through a proper user interface (e.g. a graphical one intended for naive users, a specific application). The algorithm assumes that: − Users are unaware of MFT joining complexities, and always write a single query in terms of desired facts and dimensions. The retargeting service is responsible for rewriting the query to produce correct results, assuming as a premise that it is always necessary to bring each fact table to the same summarisation level before joining them. − Users are unaware of the existence of aggregates, and always formulate the query in terms of the original base tables and dimensions. The retargeting service is responsible for rewriting the query in terms of aggregates, if possible. − Retargeting queries involving a single fact table is a special case of MFT schemas, and therefore, the algorithm should provide good results in both cases. The remainder of this section addresses an illustration scenario and describes the algorithm. Further details on the algorithm, the required metadata and MF_Retarget prototype can be obtained in [11]. 3.3 Algorithm Illustration To illustrate the functioning of the algorithm, let us consider the example depicted in Figure 1, a simplification of the MFT schema proposed in the APB-1 OLAP Council benchmark [9]. For each fact table the fields prefixed by * compose its primary key. In dimension tables, only one field, the lower one in the hierarchy, composes its primary key (also prefixed by *). The branches show the referential integrity from a fact or aggregate table for each of its dimensions. The MFT schema of Figure 1(a) shows two fact tables (Sales and Inventory), related by three conformed dimensions: Customer, TimeDim and Product. Sales have an additional dimension, namely Channel. Also, Figure 1(a) shows some possible aggregates for this schema. In the picture, grey boxes correspond to shrunken dimensions, i.e. hierarchic dimensions without one or more lower-level fields. For example, consider a user wishes to analyse comparatively quarterly Sales of product divisions, with the corresponding status of the Inventory. This query cannot be answered simply by joining facts of distinct tables, because these facts represent information on different granularity, and therefore, they should be brought to the same summarisation level before they can be joined, otherwise inaccurate results will be produced. To free the user from the difficulties involved in MFT schemas, MF- Retarget assumes the user states a single query in terms of facts, dimensions and desired aggregation level (the input shown in Figure 2, for the example considered). The retargeting mechanism has then two goals: to correct the query, and try to make it more efficient with the use of aggregates. Considering the aggregates illustrated in Figure 1(a), the algorithm realises that Aggregate4 is the best candidate to answer the question, because it contains all necessary data, is the smaller one, and already joins Sales and Inventory tables (in that order). In the absence of Aggregate4, Aggregate1 and Aggregate2 will be used. If the algorithm does not find any aggregate that can answer the query in a more
  • 5. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 45 TimeDim Year Sales Quarter Inventory Customer *Month *Cust_ID *Cust_ID Channel *Prod_ID *Prod_ID Retailer *Chan_ID *Store *Time_ID *Base *Time_ID Product StockUnits UnitsSold DollarSales Division Line Family Group Class Aggregate3 *Code *Prod_ID *Time_ID Aggregate1 StockUnits *Cust_ID Sh_Customer Aggregate2 *Prod_ID *Chan_ID *Retailer *Cust_ID Sh_TimeDim *Time_ID *Prod_ID UnitsSold *Time_ID Year DollarSales *Quarter StockUnits (a) Sales Inventory Sh_Product Aggregate4 Division *Prod_ID Aggregate1 Aggregate2 Aggregate3 *Line *Time_ID UnitsSold DollarSales Aggregate4 StockUnits (b) Fig. 1. MFT schema and possible aggregates, and schema derivation graph efficient way, at least it transforms the query to produce correct results. Figure 3 shows the results from the algorithm for these three situations. It should be pointed out that the best aggregate is not always the one that already joins distinct fact tables. Indeed, in case smaller individual aggregates exist, the cost of joining them can be smaller than the cost of summarising a much bigger joined pre- computed aggregate. 3.4 The Algorithm The algorithm assumes that users have to inform only the tables (fact/dimensions), the grouping columns (which are the same ones listed in the select clause), the summarisation functions applied to the measurements, and possibly additional restrictions in the where clause. It considers the following restrictions to input queries: a) monoblock queries (select from where group by); b) only transitive aggregation functions are used; c) all dimensions listed in the from clause apply to all fact tables listed. For the algorithm, the relationship between schemas is represented by a directed acyclic graph G(V, E). In the graph, V represents a set of star schemas, and E corresponds to set of derivation relationships between any two schemas. The edges of E form derivation paths, meaning that schema at the end of any path could be derived by the aggregation of schema related to the start of that path. The use of graph structures for representing relationships between aggregates is well known [4, 13]. Figure 1(b) presents the derivation graph for the example of Figure 1(a). We assume that only transitive aggregation functions (i.e. SUM, MAX and MIN) are used in both the derivation relationships and queries.
  • 6. 46 Karin Becker et al. “The units sold and units in stock, per quarter and product division” Select P.Division Division, T.Quarter Quarter, SUM(S.UnitsSold) UnitsSold, SUM(I.StockUnits) StockUnits From TimeDim T, Product P, Sales S, Inventory I Where T.Month=S.Time_ID and P.Code=S.Prod_ID and T.Month=I.Time_ID and P.Code=I.Prod_ID Group by P.Division, T.Quarter Fig. 2. Input SQL query from a naive DW user a) considering the existence of Aggregate4: Select P.Division Division, T.Quarter Quarter, SUM (UnitsSold) UnitsSold, SUM (StockUnits) StockUnits From Sh_TimeDim T, Sh_Product P, Aggregate4 A4 Where T.Quarter=A4.Time_ID and P.Line=A4.Prod_ID Group by P.Division, T.Quarter b) in the absence of Aggregate4: Create view V1 (Division, Quarter, UnitsSold) as Select P.Division, T.Quarter, SUM (S.UnitsSold) From Sh_TimeDim T, Product P, Aggregate1 A1 Where T.Quarter=A1.Time_ID and P.Code=A1.Prod_ID Group by P.Division, T.Quarter Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter, SUM (I.StockUnits) From Sh_TimeDim T, Product P, Aggregate2 A2 Where T.Quarter=A2.Time_ID and P.Code=A2.Prod_ID Group by P.Division, T.Quarter Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnits From V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter c) if no aggregates are found: Create view V1 (Division, Quarter, UnitsSold) as Select P.Division, T.Quarter, SUM (S.UnitsSold) From TimeDim T, Product P, Sales S Where T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter, SUM (I.StockUnits) From TimeDim T, Product P, Inventory I Where T.Month=I.Time_ID and P.Code= I.Prod_ID Group by P.Division, T.Quarter Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnits From V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter Fig. 3. Possible outputs of the algorithm The algorithm is divided into 4 steps, which for clarity purposes are individually presented and illustrated using the example of Section 3.1: 1. Divide the original query into component queries; 2. For each component query, select candidate schema(s) for answering the query; 3. Select best candidates. 4. Rewrite the query Step 1: Division into Component Queries. For each fact table Fi listed in the from clause of the original query Q, a component query Ci (i>0) is created, according to the following algorithm: 1. For each fact table Fi listed in the from clause of Q, create a component query Ci such that: 1.1. Ci from clause := Fi and all dimensions listed in the from clause of Q; 1.2. Ci where clause := all join conditions of Q necessary to relate Fi to the dimensions, together with any additional conditions involving these dimensions or Fi; 1.3. Ci group by clause := all attributes used in the group by clause of Q; 1.4. Ci select clause := all attributes used in the group by clause of Q, in addition to all aggregation function(s) applied to Fi attributes.
  • 7. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 47 Notice that a query referring a single fact table is treated as a special case of queries involving several fact tables. In that case, Step 1 produces a single component query that is equal to the original query Q. Figure 4 shows the component queries created for the query input illustrated in Figure 2. C1 -> Select P.Division, T.Quarter, SUM (S.UnitsSold) UnitsSold From TimeDim T, Product P, Sales S Where T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter C2 -> Select P.Division, T.Quarter, SUM (I.StockUnits) StockUnits From TimeDim T, Product P, Inventory I Where T.Month=I.Time_ID and P.Code=I.Prod_ID Group by P.Division, T.Quarter Fig. 4. Component queries for input of Figure 2 Step 2: Candidates for Component Queries This step generates for each component query Ci (i>0) resulting from Step 1 the respective candidate set CSi. Each candidate belonging to CSi is a schema (base or aggregate) that answers Ci. 2 For each component query Ci, generated in Step 1: 2.1. Let n:= the node that corresponds to the base schema of Ci; mark n as “visited”; let CSi := n; 2.2. Using a depth-first traversing, examine all schemas derived from n until all nodes that can be reach from it are marked as “visited”; 2.2.1. Let n := next node; mark n as “visited”; 2.2.2. If all query attributes (select and where clauses) of Ci belong to schema n and each aggregation function of the select clause of Ci is exactly the same of one used in the fact table of schema n Then CSi := CSi ∪ n; Else Mark all nodes that can be reached from n as “visited”. Each time this step is executed for a component query, the graph is traversed using a depth-first algorithm starting from the corresponding base schema. When the algorithm detects that a schema cannot answer the component query, all schemas at the end of a derivation path starting from it are disregarded. Each CSi element is a valid candidate to replace Fi in the from clause of Q. Therefore, every tuple (e1, …, en), where e1 ∈ CS1, …, en ∈ CSn, are valid combinations for the rewrite of Q. Considering the graph depicted in Figure 1(b), and the component queries of Figure 4: − CS1 = {Sales, Aggregate1, Aggregate4} for component query C1; − CS2 = {Inventory, Aggregate2, Aggregate3, Aggregate4} for C2. Step 3: Selection of Best Candidates Let T be a set of tuples (e1, …, en), where e1 ∈ CS1, …, en ∈ CSn, (n>0), representing the Cartesian Product of candidate sets CS1 X .. X CSn. Let t be a tuple of T. The present version of the algorithm bases this choice on the concept of accumulated size to choose the best candidate. The accumulated size of t(e1, …, en), AS(t), is a function that returns the sum of records that must be handled if the query were rewritten using t. For summing the number of records, AS(t) computes only once the size of a given
  • 8. 48 Karin Becker et al. table, in case it is included in more than one candidate set CSi. Thus, if Aggregate4 is chosen only its records need to be processed, and only once. In all other cases, records from the different fact tables in t are processed, considering each table only once. This may suggest that, in a multi-fact query, the best t will ever be the one where e1 = …= en. However, this is not true. Indeed, the cost of processing more records (I/O cost) has a stronger impact than the cost of joining tables. Notice that this step can be improved in many ways, by varying the cost function used to prioritise the candidates for query rewrite. An immediate improvement of this function is the consideration of index information to be combined with table size. 3 Consider all CSi sets generated in Step 2 and T, the Cartesian Product of candidate sets CS1 X .. X CSn 3.1. t' := t(e1, …, en) ∈ T with the smallest accumulated AS(t), considering all t(e1, …, en) ∈ T, where e1 ∈ CS1, …, en ∈ CSn; Step 4: Query Reformulation Once the best candidate for each component query is determined, the query is rewritten. If the set of best candidates resulting from Step 3 has a single element, i.e. a common aggregate for all component queries, a single query is written using such aggregate and respective shrunken dimensions. This is the case for our example, where Figure 3(a) displays the rewritten query. Otherwise, the query is rewritten in terms of views that summarise the best aggregates individually, and then join them (e.g. as in Figure 3(b) and (c)). If there is a common best candidate that answers more than one component query, but not all of them, a single view is created for that set of component queries. This trivial algorithm is not presented due to space limitations. 4 Tests Initial performance tests were performed based on the MFT schema presented in the APB-1 OLAP Benchmark [9], which comprises 4 fact tables. For the tests, we used Inventory, Sales and corresponding dimensions, as depicted in Figure 1(a). The aggregates were defined to experiment performance under different levels of aggregation (compression factor). We did not use the semantics of the aggregates, nor user requirements expressed in the benchmark (e.g. queries). We also disregarded the number of records of the resulting database for aggregate selection. Two tests were executed, referred to Test1 and Test2, described in the remaining of this section. 4.1 Test1 The goal of Test1 was to verify whether the algorithm performed well and correctly, considering both star and MFT schemas. APB-1 program was executed with parameters 10, 0.1 and 10, which resulted in a Sales table with 1,239,300 records, whereas Inventory comprised 270,000 records. We executed five queries, three of them involving a single fact table (Sales), and two of them the join of both fact tables. For each query, we calculated the number of records of the resulting table, and the processing time considering all possible alternatives.
  • 9. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 49 It was possible to verify that the algorithm always chose the aggregate with the smallest processing time, regardless whether the query involved a single fact table or a join of multiple fact tables. Proportionally, there was a significant gain in the vast majority of cases, but absolute gains were not always significant. The magnitude of performance gain seems to be a function of the (fact) table size, aggregate compression factor, and output table size. 4.2 Test2 Test2 was executed running APB-1 with Inventory Sales parameters 10, 1 and 10. The derivation graph for this test is depicted in Figure 5, I1 S1 and Table 1 describes the properties of the schemas: number of records, compression factor (CF) with regard to both the base I2S2 fact table and its deriving schema(s), and I3 S3 the difference between the derived/ deriving schemas in terms of dimensions. Fig. 5. Derivation graph used in Test2 We executed a single query that involved facts from both Sales and Inventory tables, and which could be answered by (the combination of) all aggregates. The goal was to compare the respective absolute processing times. Table 2 displays in the first row the elapsed time measured for each execution. Subsequent rows show the gains in using an aggregate with regard to bigger alternatives. For instance, the use of aggregate I2S2 represents a gain of 52% with respect to the join of I1 and S1, calculated as (time(I1 and S1) – time(I2S2))/time(I1 and S1), and 96% with regard to the join of Inventory and Sales. It is possible to verify that the gains considering absolute times are very significant this time. The use of the accumulated size AS(t) as the main criterion to prioritise aggregate candidates seems to be simple but efficient, although it still can be improved in many ways, particularly indexing. 5 Conclusions In this work we presented MF-Retarget, a retargeting mechanism that deals with both MFT schemas and statically computed aggregates. The algorithm provides two types of transparency: a) aggregate unawareness, and b) users are spared from the complexities of queries in MFT schemas. This retargeting service is intended to be implemented as a layer between user front-end tool and DBMS engine. Thus, it can be complementary to gains already provided by OLAP tools/DBMS engines in the context of dynamic computation of aggregates. Further details on the implementation can be obtained in [11].
  • 10. 50 Karin Becker et al. Tab. 1. Inventory/Sales derivation graph description Table or Records CF (base Deriving CF(Deriving Shrunken Eliminated Aggregate schema) schema schema) Dims. dims. Sales (S) 13,122,000 S1 2,400,948 18.3 % Sales 18.3 % Yes Yes S3 614,520 4.7 % S1 25.6% Yes Inventory (I) 12,396,150 I1 2,496,762 20.1% Inventory 20.1% Yes I3 631,800 5.1 % I1 25.3 % Yes I2S2 2,400,948 18.3 % (S) S1 and I1 96.2% I1 19.4 % (I) 100 % S1 Tab. 2. Results with larger tables Query1: 614,520 rec. Inventory and Sales I1 and S1 I2S2 I3 and S3 Time (hours:min:sec) 20:35:43 1:27:04 0:42:09 0:15:22 Inventory/Sales (%) 93 96 99 I1 and S1 (%) 52 82 I2S2 (%) 64 AS(t) 25,518,150 4,897,710 2,400,948 1,246,320 Preliminary tests confirmed the algorithm always provided the best response time. Proportional gains are always significant, but absolute gains increase with bigger fact tables. It is obvious that additional tests are required to determine precise directives for the construction of aggregates in MFT schemas, and under which circumstances the processing gains are significant. It is also important to refine our criteria for selecting the best candidate. It is also important to use indexing information in addition to table number of records for aggregate selection. Future work also includes, among other topics, the better definition of function costs for prioritising candidate aggregates, use of indexes in the function cost, the integration of the retargeting mechanism into a DW architecture, support for aggregates monitoring and recommendation for aggregates reorganisation, the use of the proposed algorithm in the context of dynamic aggregate computation. Acknowledgements This work was partially financed by FAPERGS, Brazil. References 1. Baralis, E., Paraboshi, S., Teniente, E. Materialized Views Selection in a Multidimensional Database. Proceedings of the VLDB’97 (1997). 156-165. 2. Chaudhuri, S., Shim, K. An Overview of Cost-Based Optimization of Queries with Aggregates. Bulletin of TCDE (IEEE), v. 18 n. 3 (1995). 3-9.
  • 11. MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 51 3. Gray, J., Chaudhuri, S. et al. Data cube: a relational aggregation operator generalizing Group-by, Cross-tab and Subtotals. Data Mining and Knowledge Discovery, v. 1, n. 1 (1997) 29-53. 4. Gupta, A., Harinarayan, V., Quass, D. Aggregate-query Processing in Data Warehousing Environments. In: Proceedings of the VLDB’95 (1995). 358-369. 5. Gupta, H.; Mumick, I. Selection of views to materialize under a maintenance cost constraint. Proceedings of the ICDT (1999). 453-470. 6. Kimball, R. et al. The Data Warehouse Lifecycle Toolkit : expert methods for designing, developing, and deploying data warehouses. John Wiley & Sons, (1998) 7. Kotodis, Y., Roussopoulos, N. Dynamat : A Dynamic View Management System for Data Warehouse. Proceedings of the ACM SIGMOD 1999. (1999) 371-382 8. Meredith, M., Khader, A. Divide and Aggregate: designing large warehouses. Database Programming & Design. June, (1996). 24-30. 9. OLAP Council. APB-1 OLAP Benchmark (Release II). Online. Captured in May 2000. Available at : http://www.olapcouncil.org/research/bmarkco.htm, Nov. 1998. 10.Poe, V., Klauer, P., Brobst, S. Building a Data Warehouse for Decision Support 2nd edition. Prentice Hall (1998). 11.Santos, K. MF-Retarget: a multiple-fact table aggregate retargeting mechanism. MSc. Dissertation. Faculdade de Informatica - PUCRS, Brazil. (2000). (in Portuguese) 12.Sapia, C. On modeling and Predicting Query Behavior in OLAP Systems. Proceedings of the DMDW’99. (1999) 13.Sarawagi, S. Agrawal, R. Gupta, A. On Computing the Data Cube. IBM Almaden Research Center : Technical Report. San Jose, CA (1996). 14.Srivasta, D., et al. Answering Queries with Aggregation Using Views. Proceedings of the VLDB’96 (1996). 318-329 15.Wekind, H., et al. Preaggregation in Multidimensional Data Warehouse Environments. Proceedings of the ISDSS'97 (1997). 581-59 16.Winter, R. Be Aggregate Aware. Intelligent Enterprise Magazine, v. 2 n. 13 (1999).