Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for Oracle Data Warehouse (part 1)
1. Recipes of
Data Warehouse
and
Business Intelligence
Massimo Cenci Data Warehouse Blog
http://massimocenci.blogspot.it
A messaging
system for Oracle
Data Warehouse
(part 1)
Micro ETL Foundation
2. Introduction (1)
• A goal of what I have called the Micro ETL Foundation (MEF) is to provide some simple and immediate solutions to the need of a
Data Warehouse. What could be more simple (and necessary) of a log message? It may seem strange to devote an entire article to
describe how to report the message "Hello world"; message, famous for all the programmers of the computer world. But we find
that what appears simple is actually a bit more complicated.
• If we take the first paragraph of any programming book, (eg. java) the solution is expressed with a single line:
System.out.println ( Hello Word! ");
• If we use another programming language (eg. PL / SQL Oracle), the solution is again expressed in a single line:
Dbms_output.put_line ('Hello World');
• What I want to emphasize is the incompleteness of the received message; In fact, the complexity lies not in the message itself: at
the end it is a simple function call. The complexity is the lack of context. Complexity is the metadata.
• Suppose that at this time are running more loading processes. The fact of receiving a warning or alert message is not enough.
What I want to know is when the message was sent, which job was running, how much time has elapsed since the previous
message, which was the current procedure at the time of the message.
3. Introduction (2)
• We find out that answer to all these questions is no longer trivial. When we go out from theoretical examples printed on the books
and enter into the reality of the daily work, everything becomes more complicated.
• We will see the theory and practice, then the code, which will give us an answer to contextual needs described above. And the
solution will be (surprise!), again expressed in a single line.
• The programming language used is the Oracle PL / SQL (simply because it is widely used in the Data Warehouse world and then I
turn to a wider audience), but the techniques exposed are easily playable on any other RDBMS. I forgot: we also see how to
ensure that the message is sent via e-mail.
• This, however, is not just for programmers. It 'an article that shows all the hard work behind a simple log message. Just imagine the
work that is behind the complete loading process of a Data Warehouse.
4. Definitions
Naming Convention
I spent a lot of words on the importance and need to always use a naming convention for the Data Warehouse projects. I summarize
some concepts. In general, the naming convention is the method by which you choose to assign a name to the various entities of the
system being designed. This method should produce a name that must be able to represent immediately the semantic nature of the
entities that we will use. It does not matter if they belong to a Data Warehouse system or an operational system.
ETL Process
The ETL process is the set of programs that load data from external systems into the Data Warehouse tables . Since this set can be very
complex and detailed, it is extremely important to have a messaging system that gives me the most information possible about the
process.
Job
A job is a unit of logic describe a very specific task, such as loading a dimension of analysis or of a fact table or both. A set of job-related
between them is in turn a job that is part of a schedule. Ie is activated at a preset time.
In turn, a job consists of simple and sequential processing units that we call unit.
Unit
The unit is the elementary processing unit (so it is code) which in turn can call other procedures and functions that can be defined in a
generic way, modules
Execution
A job usually runs at night, but it could also turn several times a day. Each execution of a job is a run that has to be identified with a
sequential number, which can be defined as exec counter.
For the time being be satisfied with these definitions very short. Later we will analyze in more detail
6. Requirements
• The requirements are very simple. To have a basic messaging system that signals everything that it is useful to monitor the
execution of the ETL process.
• It is a task of the designer to decide the content of these messages, which may simply be informative, as the number of rows
processed, or of particular attention, such as the identification of errors or anomalies. In these cases it should be possible to send
the message via e-mail.
• The system should not merely store all messages generated, but must also have the context information.
• The context is described in the paragraph relative to the definition of the ETL process. More context information we insert, and
more we can control the system and we will be more efficient and faster to resolve the problems.
• I suppose a minimum knowledge of Oracle and SQL and PL/SQL languages.
7. Design
• The design of the messaging component of the Micro ETL Foundation (MEF), consists of 3 tables, 2 sequences and 1 package.
• We can see them as the basic components. The minimum necessary for the other MEF components that we will use in the future.
• The naming convention used is a smaller version of the one to be applied to all objects of a Data Warehouse. [See
http://www.slideshare.net/jackbim/recipes-6-of-data-warehouse-naming-convention-techniques].
• But it would be unreasonable to apply to MEF the same logic that is used to organize the hundreds of tables typical of a large Data
Warehouse. Also we can use MEF in any type of project. We will use this simplified structure:
<entity name>=<area code>_ <type name>_<logical code>
• The basic package, being common to all, will be further simplified by eliminating the <logical name>.
• The area code is "MEF".
• The table MEF_CFT is the table of system configuration.
• The table MEF_EMAIL_CFT is the specific configuration table for e-mail addresses, the table MEF_MSG_LOT is one that will keep
the text messages.
• The sequence MEF_MSG_SEQ gives a sequence number to each message.
• The sequence MEF_RUN_SEQ is used to numerically identify each job execution.
• The Oracle package MEF is a library of programs that manages those objects.
• All scripts for creating objects, can be downloaded from slideshare that show in a pragmatic and fun way the system
[http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2]. The scripts are very
minimal, with only the main structures, leaving the reader with the completion of all other accessory structures such as indexes,
constraints, etc..
• The tables will be created in the default tablespace, but you should always create ad-hoc tablespace.
8. The MEF_CFT table
• This table contains general information about the MEF and the Data Warehouse. Other columns will be added in the future (or on
your choice)
prj_cod: Code of the Data Warehouse project
user_cod: name of the Oracle user for ETL.
email_srv_cod: E-mail Server. Every corporate has always a server for managing e-mail, indicate here that server.
mef_root_txt: Path of the folder of MEF scripts.
mef_dir: Oracle directory pointing to mef_root_txt
9. The MEF_MSG_LOT table
• This table stores all log messages that are sent by loading process.
seq_num: Sequential number of the message. It is obtained from an Oracle sequence.
day_cod: Time stamp of the message insert in the format YYYYMMDD.
sched_cod: The identifier for the schedule to which the job belongs.
job_cod: job identifier. It is a logical entity in the sense that we think of it as the launch of a list of processing units.
unit_cod: Identifier of the processing unit within the job. We can think of it as a procedure or a function of the Oracle package.
module_cod: module Identifier. A unit, though complex, can in turn call the sub-routine or subfunctions, ie modules.
In this case, it is interesting to know yhid detail.
rows_num: Number of rows processed. Typically, this field is not set, but if we want to report the number of rows, for example,
inserted into a table, we also have this information.
line_txt: Message text
cline_txt: Message text in CLOB type.
ss_num: Number of seconds that have elapsed since the previous message. This information, together with the next two,
provides a summable data. If we wanted to know how long it took all the statements of a certain kind, we would be able to
calculate.
mi_num: Number of minutes that have elapsed since the previous message
hh_num: Number of hours that occurred since the previous message
elapsed_txt: Time elapsed since the previous message in the format HH24:MI:SS
stamp_dts: Time stamp insertion of the message.
exec_cnt: The identifier for the execution of the job. Every run of a job should be characterized by a number,
in turn, extracted from an Oracle sequence.
user_cod: Oracle user who posted the message. It is setted automatically by a session variable. It can be useful in cases
where multiple users contribute to the loading process(not recommended)
10. The MEF_EMAIL_CFT table
• This table configures the email addresses.
email_cod: Code to identify a group of recipients.
from_txt: Sender of the message. This name will appear as the sender of the e-mail message. Do not use special characters, nor
the blank between words. Eg. not set "Administrator ETL" but "Amministratore_ETL" otherwise you get a run-time error message
like:
ORA-29279: SMTP permanent error: 501
5.5.4 Invalid arguments
to_txt: Identifier of the message recipient
cc_txt: This e-mail address of the recipient in knowledge
subj_txt: Subject default message
status_cod: Status (1 = active, 0 = inactive) of the recipient
11. The MEF_MSG_SEQ sequence
• The sequence is an Oracle object. In practice it is a universal counter that increments each time you request it. Each message line
must have its sequential number. It is more functional than time stamp to sort the table.
• Because sometimes the messages are separated at a fraction of a second of each other, the time stamp might not be sufficiently
discriminating.
The MEF_RUN_SEQ sequence
• Sequence that indicates unambiguously, a run of a job.
12. The MEF package (1)
• This is the basic package. I suggest to develop all the code inside PL/SQL packages (they are basically libraries), which allow a better
management and use of the code. Now a short description of the units contained in the package.
• f_str: Utility function to generate a string after replacing the input variables.
• p_ins_msg_lot: This procedure perform the insertion of the message in the MEF_MSG_LOT table receiving a variable of row type
as input parameter. This procedure has the "pragma autonomous_transaction". It is very important and requires a thorough
description. It seems incredible, but if there were not, it would not be possible this messaging system. The pragma
autonomous_transaction allows us to commit (i.e. to validate into the database) only and exclusively the DML statements of the
unit which contains this compiler directive. This concept is crucial because it allows us to commit (so to insert data into the
database) without affecting the logic of the loading process. Let's clarify with an example. Typically, before to loading the daily data
into a table, you delete, first of all, the data of that day, and then load/reload the new data. (forget for a moment the partition
manipulation) You do the commit at the end, where the loading, (delete/insert), was successfully completed. If i ran a commit after
the delete and the insert has a problem, I could have the loss of data of the day. As the messaging needs to do a commit of the
message in the table, the execution of the innocent message "I have done the delete", would validate even the delete itself. And
this is a side effect not acceptable. The autonomous transaction solve the problem: the PL/SQL Oracle engine, produce a
"daughter" transaction who live an autonomous life, which validates the data in the MEF_MSG_LOT table without affecting the
logic of the parent transaction.
• p_rae: The management of the exception is standardized in the following manner. When any Oracle error happen, in the "when
others" instruction there is the call of the p_rae procedure that enriches the content of the error with other useful information,
such as, for example, where the exception occurred. The output will always be the standard error pv_error. The "when pv_error" is
used to make sure that you keep the original error.
• p_init: Other private procedure. It initializes the line_row variable to perform the insert in the table.
• delta_time: Procedure that, based on the input parameters, such as the date of the last message and the current date , calculates
all the delta-time information, how many seconds, minutes, hours have passed and a delta time in the 'HH24:MI:SS' format. The
ways in which we can calculate the delta time are numerous: one used is just one of many.
• f_get_seq_val: Function that extracts generalized, in a dynamic way, the next number of the Oracle sequence whose name is given
in the parameter.
• f_get_exec_cnt: Function that extracts the execution number of the job. Before calling the generic function f_get_seq_val, it verify
that has not already been set as a global variable: in this case, use the number of the current execution.
13. The MEF package (2)
• f_get_cft: Function that extracts the current configuration from the MEF_CFT table .
• p_esend: A function that performs the sending of email via the UTL_MAIL package.
• p_send: This is the procedure that sends the message.
• p_mail: Procedure for sending the e-mail. Using the email code in inputit looks for all the recipients in the MEF_EMAIL_CFT table
and calls the p_esend passing all required parameters.
14. System configuration (1)
• Before you can create all the structures described above, you must perform some environment check.
• First of all it is necessary to verify that the Oracle RDBMS is designed for sending emails.
• So you have to make sure that the Oracle user that sends the messages has all grant necessary for its operations. Let's see in detail.
SMTP check
• We need to verify that the Oracle RDBMS is designed for sending emails. In fact the sending is activated by calling a procedure that
is part of a package of the RDBMS. To verify this, connect to SQL * Plus with SYS user and check the package UTL_MAIL (from
Oracle 11):
Sqlplus / as sysdba
SQL> descr utl_mail
ERROR:
ORA-04043: object utl_mail does not exist
• If you get this error message, you need to install the package system UTL_MAIL and give the execute permission to the user.
• Always as user SYS, run the script to install the package from the rdbms/admin folder of the Oracle home, and give the execution
permissions to the ETL user.
• So you have to check the smtp server in the Oracle initialization parameter file; in the following example it is not setted and I
indicate you how to do it. Here is the sequence of instructions for making these checks (remember to replace the file path of the
utlmail.sql file with the relative path to your Oracle installation):
15. System configuration (2)
D:>sqlplus / as sysdba
SQL*Plus: Release 11.2.0.1.0 Production on Wed Jun 25 15:35:49 2014
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning option
SQL> @...RDBMSADMINutlmail.sql
SQL> @...RDBMSADMINprvtmail.plb
Package created.
Synonym created.
SQL> grant execute on UTL_MAIL to <ETL user>;
Grant succeeded.
SQL> sho parameters smtp
NAME TYPE VALUE
------------------------------------ ----------- -------
smtp_out_server string
SQL> alter system set smtp_out_server = <email server> scope=both;
System altered.
16. System configuration (3)
• Almost always, in the company there is a mail server, the value should be include the domain, for example. exch.dev.com.
• The option "scope = both" make the change effective immediately and permanently.
• We must also create and configure an ACL (from Oracle11) using the system package dbms_network_acl_admin. The ACL (Access
Control List) is only one way to define external resources to the RDBMS (such as email servers) and allow access to users.
• Now open slideshare [http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2]
where you will find all the instructions on how to download and run the installation script of this messaging system. The script will
execute all the necessary settings for you.
• All this will take no longer than 5 minutes of work.
17. Test
• In addition to the tests showned in the slides of slideshare, we perform another test now, certainly more exhaustive, which clearly
shows the functionality simulating a piece of ETL load.
• We start by creating a test table, initializing it from a system table. At this point we run an anonymous block of code (anonymous
block is defined as the set of SQL statements included between a begin and an end) that will simulate a real load of a table with
insert and delete of data. The sequence of steps is quite simple: you initialize global variables package to better identify the various
steps in the messaging table.
create table SALES as select * from tabs;
begin
mef.pv_sched_cod := 'Daily';
mef.pv_job_cod := 'Staging tables';
mef.pv_unit_cod := 'Load sales table';
mef.pv_exec_cnt := 10;
mef.p_send('proc_prova','Load of SALES table');
mef.p_send('proc_prova','Deleting...');
delete from sales;
mef.p_send('proc_prova','Deleted');
mef.p_send('proc_prova','Loading...');
insert into sales select * from tabs;
mef.p_send('proc_prova','Loaded');
mef.p_mail('MEF','ETL_administrator','Sales table loaded');
mef.p_send('proc_prova','Load ended');
end;
/
• The final result obtained, which can be seen in the table MEF_MSG_LOT, it is very interesting, and gives you the wealth of
contextual information that is talked about in the beginning.
18. Conclusion
• We saw in detail the steps required to build a messaging system, simple, but very useful for all the people working in the Data
Warehouse projects.
• This implementation, which is the basis of my Micro-ETL-Foundation, obviously works on any Oracle-PL/SQL project, is non-
invasive, and can be applied at any time inserting simple procedure calls in an existing ETL process.