SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
ETL

                                  March 2011
                                    Mgr. Jan Ulrych




All rights reserved Javlin 2011
Organizational Matters

      •     Introduction to ETL
      •     More about ETL
      •     CloverETL intro
      •     ETL Projects
      •     Current Trends




All rights reserved Javlin 2011
Presenter

      • Jan Ulrych
                › Graduated from Faculty of Mathematics and Physics
                  at Charles University, Prague
                › Works for Javlin a.s. as ETL Consultant since 2008
                › E-mail: jan.ulrych@javlin.eu


      • Professional experience
                › DVRA project at DHL IT Services Europe, Prague
                › ETL Consultant on various data integration projects
                › Since 2010 is a Pre-sales consultant for CloverETL™


All rights reserved Javlin 2011
Javlin Overview
      • Javlin – since 2005
      • Javlin is a software developer and services provider
                › CloverETL platform
                › Javlin services and ETL consulting
                › Software development for major clients

      • Employees: staff of 60+
                › Developers and consultants
                › Service, sales, support
                › Executive management

      • Experienced management team
                › Data and ETL software development legacy
                › Key industry expertise – finance, health, media, logistics, government

      • Office locations
                › Prague
                › Brno
                › Greater Washington DC


All rights reserved Javlin 2011
Selected Customers




All rights reserved Javlin 2011
Session 1:
 Origins & Motivation


All rights reserved Javlin 2011
Data Warehousing

      • “A data warehouse is a system that
        extracts, cleans, conforms, and delivers source data
        into a dimensional data store
        and then supports and implements querying
        and analysis for the purpose of decision making.”
            Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004



      • The most visible part is
                    “querying and analysis”

      • The most complex and time consuming part is
           “extracts, cleans, conforms, and delivers”

All rights reserved Javlin 2011
Data Warehousing

      • The most complex and time consuming part is
           “extracts, cleans, conforms, and delivers”

      • How complex is it?
                › 70-80% of BI (DI or DW) project is reliable ETL process




All rights reserved Javlin 2011
Getting data into DW

      • How to load data into DW?
                ›    Scripts in linux shell, perl, python, …
                ›    sqlldr + SQL
                ›    Hardcoded in Java, C#, C
                ›    In-house built ETL tool
                ›    Off-the shelf ETL tool

      • Aspects to be kept in mind
                ›    Manageability
                ›    Maintainability
                ›    Transparency
                ›    Scalability
                ›    Flexibility
                ›    Complexity
                ›    Auditing
                ›    Job restartability
                ›    Testing

All rights reserved Javlin 2011
Session 1:
 Introduction to ETL


All rights reserved Javlin 2011
ETL

      • How to load data into DW in a right way?
                › Introduce formal ETL process


      • This section covers
                ›    What ETL is
                ›    Motivation
                ›    Where to use ETL
                ›    How to Implement ETL
                ›    Key ETL Aspects



All rights reserved Javlin 2011
Motivation

      • Is ETL interesting area?
                › 70-80% of BI (DI or DW) project is reliable ETL process.


      • Let’s have a look on the DW & DI market size
                ›    In 2003, DI was USD 9.3 billion market
                ›    In 2008, DI was USD 13 billion market
                ›    By 2010, yearly grow estimated to USD 2.2 billion
                ›    TCO of DI can reach USD 509,600 annually


      • The more systems in the world,
        the more work in Data Integration!

All rights reserved Javlin 2011
What is ETL?

      • ETL = Extract – Transform – Load
      • Extract
                › Get the data from source system as efficiently as
                  possible
      • Transform
                › Perform calculations on data
      • Load
                › Load the data in the target storage




All rights reserved Javlin 2011
What is ETL?

      • ETL = Extract – Transform – Load
      • Extract
                › Get the data from source system as efficiently as possible

      • Clean
                › Perform data cleansing and dimension conforming

      • Transform
                › Perform calculations on data

      • Load
                › Load the data in the target storage


All rights reserved Javlin 2011
Why is ETL (System) Important?

      • Adds value to data
                › Removes mistakes and corrects data
                › Documented measures of confidence in data
                › Captures the flow of transactional data
                › Adjusts data from multiple sources to be used
                  together (conforming)
                › Structures data to be usable by BI tools
                › Enables subsequent business / analytical data
                  procesing




All rights reserved Javlin 2011
ETL Disambiguation

      • ETL = Extract – Transform – Load
                › Not tight specifically to DW anymore

      • Process/System
                › A complete process including
                         •    Data extraction
                         •    Enforcing DQ and consistency standards
                         •    Conforming data from disparate systems
                         •    Delivering data to target
                         •    People, HW, Documentation, Support, etc.
      • Tool
                › A piece of software implementing the
                         • three (four) E-(C)-T-L steps.
                         • A tool designed specifically to perform data transformations

All rights reserved Javlin 2011
ETL Process

         Presentation
                                   Dashboards, Reports, Portals



        Analytics
                              BI Tools (SAP, Cognos), KPI, Data Mining

                                                                            Integrated data
                                                                         value for applications
       ETL Data Integration

                                               ETL
                                                                            Extracting value
       Data                                                               from the database
                                  DBMS (MS SQL, MySQL, Oracle),
                                   XML, flat files, CSV, mainframe




All rights reserved Javlin 2011
ETL Tool: True Data Integration



             Source A                                 ETL                         Source B
          Files, Databases,                          Read                       Files, Databases,
          Message Queues,                          Apply Logic                  Message Queues,
            Web Services                             Write                        Web Services




                         True Data Integration is agnostic of source or target application

                                    ETL is a bridge for bi-directional flow



All rights reserved Javlin 2011
ETL Data Integration Solutions (1)
                                  Data Migration
                                  Process of transferring data between storage types or
                                  formats. An automated migration frees up human resources
                                  from tedious tasks. Design, extraction, cleansing, load and
                                  verification are done for moderate to high complexity jobs.


                                  Data Consolidation
                                  Usually associated with moving data from remote locations
                                  to a central location or combining data due to an
                                  acquisition or merger.



                                  Data Integration
                                  Process of combining data residing at different sources and
                                  providing a unified view. Emerges in both commercial and
                                  scientific fields and is focus of extensive theoretical work. Also
                                  referred to as Enterprise Information Integration.


All rights reserved Javlin 2011
ETL Data Integration Solutions (2)
                                  Master Data Management
                                  Processes and tools to define and manage non-transactional
                                  data. Provides for collecting, aggregating, matching,
                                  consolidating, quality-assuring, persisting and distributing
                                  data to an organization to ensure consistency and control.



                                  Data Warehouse
                                  Repository of electronically stored data. ETL facilitates
                                  populating, reporting and analysis. Includes business
                                  intelligence as well as metadata retrieval and management
                                  tools.


                                  Data Synchronization
                                  Process of making sure two or more locations contain the
                                  same up-to-date files. Add, change, or delete a file from one
                                  location, synchronization will mirror the action at the new
                                  location.


All rights reserved Javlin 2011
Where is ETL used?




All rights reserved Javlin 2011
How to Implement ETL System (1)




      Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004

All rights reserved Javlin 2011
How to Implement ETL System (2)

      •     Scripting (shell, perl, python)
      •     PL/SQL, sqlldr
      •     Transformation hardcoded in Java, C#
      •     Develop (universal) ETL tool in-house
      •     Using off-the-shelf ETL tool




All rights reserved Javlin 2011
ETL Tool Key Features (1)

      • Extract, Load => flexible on interfaces
                ›    Flat files, DBMS, XML data, XLS,
                ›    MQ, web services, LDAP
                ›    Semi-structured data (emails, web logs, wiki pages)
                ›    Unstructured data (blogs, documents)
                ›    Extensibility with custom connectors
                ›    Local data, remote data FTP(S), SFTP, SCP, http(s)
      • Clean
                › Lookups, Validations, Filters, Translations
      • Transform
                › Changing data structure, Joins, (De)Normalization,
                  Aggregation, RollUp, Sorting, Partitioning,
                  Data De-duplication
                › Ability to call external tools


All rights reserved Javlin 2011
ETL Tool Key Features (2)

      • Performance
                › Symmetric Multiprocessing (SMP)
                         • Pipeline processing
                         • Multithreaded processing
                › Massively Parallel Processing (MPP)
                         • Clustering
                         • MapReduce
                › Load balancing
      • User friendliness
                › GUI
                › Metadata capture
                › Training time
      • Development
                › Reusable components
                › Impact Analysis / Data Lineage
                › Documentation

All rights reserved Javlin 2011
ETL Tool Key Features (3)

      • Manageability
                ›    Team collaboration
                ›    Transformation repository
                ›    Metadata repository
                ›    Development process (Dev -> Test -> Prod)
                ›    Security
      • Runtime
                › Scheduler Automation
                › Recovery and Restart
                › Workflow
      • Others
                › Vendor stability
                › Release cycle
                › Support

All rights reserved Javlin 2011
ETL Market




      Source: Ted Friedman, Mark A. Beyer, Eric Thoo: Magic Quadrant for Data Integration Tools; Gartner RAS Core Research Note G00207435; 19 November 2010



All rights reserved Javlin 2011
Well Known ETL Tools

      • Commercial
                ›    Ab Initio
                ›    IBM DataStage
                ›    Informatica PowerCenter
                ›    Microsoft Data Integration Services
                ›    Oracle Data Integrator
                ›    SAP Business Objects – Data Integrator
                ›    SAS Data Integration Studio
      • Open-source based
                ›    Adeptia Integration Suite
                ›    Apatar
                ›    CloverETL
                ›    Pentaho Data Integration (Kettle)
                ›    Talend Open Studio/Integration Suite
                                                 The list above is not meant to be comprehensive


All rights reserved Javlin 2011
ETL or ELT?

      • ETL = Extract – Transform – Load
                › Much more flexible
      • ELT = Extract – Load – Transform
                › Pushed forward (mainly) by Oracle
                         • First get the data into database
                         • Then use Oracle DB tools to work with it
                › Less flexible
                         • Tightly coupled to vendor’s database/solution
                         • Less flexible on output formats
                         • Requires staging area
                › Possibly better performance
                         • All data are processed in the same database
                         • Nothing is downloaded from database for Transform step

All rights reserved Javlin 2011
Session 2:
 Dominate Your Data with
 CloverETL™


All rights reserved Javlin 2011
What is CloverETL
                                                                Clover works on all OS
      • A data integration software platform                      › Linux
                › Manages, designs and runs your data             › Windows
                › Embeddable and scalable                         › HP-UX
                › Integrates easily with databases, operating     › AIX
                  systems and applications                        › IBM AS/400
                                                                  › Solaris
                                                                  › Mac OS X
      • ETL platform that dominates your data
            Extract – Transform – Load
             › Reads from one or more data sources
             › Transforms data in almost any way imaginable
             › Writes to any number of data targets

      • Legacy of open source and commitment to commercial use

      • CloverETL Engine can be used as embedded OEM as well

                                  www.cloveretl.com
All rights reserved Javlin 2011
CloverETL Key Features
                                                   Platform
                                                 independent
                                                 Java, integration,
                                                  library support
                         Cost                                                Scalability
              delivers lowest total                                       Desktop → Enterprise
               cost of ownership                                               → Cluster


                OEM
             Embeddable                                                         Usability
              small footprint,                                               built by ETL experts
              extensible and                                                   for ETL experts
               customizable

                                  Services                            Performance
                           Clover was built to
                                                                      outstanding at all
                            require minimal
                                                                      production levels
                                services

All rights reserved Javlin 2011
CloverETL Product Suite

                                                       CloverETL Server
               Design                       Manage     • Production Platform
                                                       • Web app’s (Tomcat,
                                                         WebSphere, GlassFish)
                                                       • For Enterprise Integration



                                                       CloverETL Designer
                                                       • Visual Designer
                                                       • Transformation Developer
                                                       • Eclipse Platform



                                                       CloverETL Engine
                                                       •   Pure Java 6.0
                                                       •   Embeddable
                                                       •   Extensible
                                  Runtime              •   Ideal for OEM


All rights reserved Javlin 2011
CloverETL Designer




                                                       Features
                                                       •   Transformation design
                                                       •   Intuitive GUI
                                                       •   Drag & drop
                                                       •   Components Library
                                                       •   Debug

All rights reserved Javlin 2011
CloverETL Vision

        Vision
        CloverETL enables companies to get more value from
        their data quickly without massive infrastructure expense
        or years of project investment.
        Dominate your data

        Approach
        CloverETL is the best value for money. It builds off the
        open source foundation of the CloverETL Engine and
        scales from desktop to enterprise to cluster.
        Get it done quickly

        Investment
        CloverETL is an easy to use ETL software that can grow by
        both core features and user expertise… at a fraction of
        the cost of larger system vendors.
        Low cost buy-in for our customers

All rights reserved Javlin 2011
CloverETL Approach

      • Components
               › Prebuilt algorithms            Input

      • Graph                                           Transform
               › Processing algorithm
                 in visual form

      • Data Flow
               › Edges between components

      • Process
               › Build > Connect > Configure
               › Data processed as structured
                 records with named fields
               › Components operate on                      Output
                 record fields



All rights reserved Javlin 2011
Transformation capabilities

      •       50+ specialized components available for use
               › Readers and writers
                         •    Text or binary files, CSV, XML
                         •    Archives including ZIP or GZIP
                         •    Remote transfers over HTTP/FTP(S)
                         •    Access to messaging via JMS

                › Database connectivity
                         • JDBC connectivity, support for bulk loaders
                         • Supports any SQL statements, stored procedure calls

                › Transformers and aggregators
                         • Variety of data manipulation algorithms: sorting, deduping, joining,
                           arithmetics, aggregating, statistic functions and more
                         • Customizable with user-defined code
                         • Alternative implementations for efficient execution

All rights reserved Javlin 2011
Physical architecture

      • Others: Server-centric with thin clients
                › Server is necessary for development and execution
                › Transformations and data are stored remotely on server
                › Limited use when working on multiple sites or in restricted-access
                  networks


      •       CloverETL: Designer is a standalone application
                › Development and execution is possible without central server
                › Transformations and metadata are stored on local machine
                › Transformations can be also deployed to server for centralized
                  management
                › Designer available for all major platforms (Linux, Windows, Mac)




All rights reserved Javlin 2011
Repository

      •       Others: Central repository
                › With proprietary storage format (binary files, tables)
                › Often managed with a proprietary version control system
                › Can be problematic with mass changes or when “hacking” is necessary

      •       CloverETL: XML format files and directories
                › No proprietary repository, uses plain files and directories
                › Transformation and metadata stored in human-readable XML
                › Open to any version management systems
                    • CVS, SVN, IBM Rational ClearCase, git
                › Integrated with Eclipse VMS clients
                    • Subclipse, EGit, ClearCase




All rights reserved Javlin 2011
Flexibility and expressive power

      •       Others: Expression-based languages only
                › Expressions and built-in functions for simple data manipulation
                › No support for programming statements (loops, user functions)
                › Limited when data manipulation requires complex coding

      •       CloverETL: Scripting and more
                › Components are customizable with CTL or Java code
                › CTL: scripting language with simple syntax
                    • Allows simple expressions as well as complex code
                    • Has variables, loops, user-defined functions
                    • Many built-in data validation functions
                › Java: mature programming language, allow coding
                    • Access to variety of existing libraries
                    • GC-based memory management = rapid development



All rights reserved Javlin 2011
Extensibility

      •       Others: Custom components
                › Limited support for developing custom components
                › Impossible to extend other aspects of data processing


      •       CloverETL: Plugin-based extensible platform
                ›    Ready to extend, customize and modify
                ›    Plugin-based architecture for easy extension management
                ›    Supports custom components, connections, functions
                ›    Implemented in Java
                       • Access to libraries
                       • Memory management
                       • Developers




All rights reserved Javlin 2011
Session 2:
 CloverETL™ Designer Examples


All rights reserved Javlin 2011
CloverETL Server




                                                     Features
                                                     •   Automation and scheduling
                                                     •   File and event triggers
                                                     •   Workflows
                                                     •   Monitoring and logging
                                                     •   User management
                                                     •   Real-time ETL
                                                     •   Clustering
                                                     •   Load balancing
                                                     •   Failover
                                                     •   Distributed processing
                                                     •   Inexpensive buy-in


All rights reserved Javlin 2011
Key features

      • Runtime automation
                › Allows integration with existing infrastructure
                › Simplified management and execution of data transformations

      • Scalability and optimization
                › Increases transformation performance
                › Shortens response time in continuous/transactional processing

      • Security
                › Controls access to data, transformations and server configuration
                › Secures communication between server and clients

      • Clustering
                › Allows cooperation between multiple processing nodes
                › Improves scalability, performance and error resiliency


All rights reserved Javlin 2011
Runtime automation

      • Internal scheduler
                › Supports one-time or periodic execution
                › Allows interval-based scheduling, including flexible cron-like rules

      • Integration with enterprise schedulers
                › Transformation execution can be started by external scheduler
                › cron, IBM Tivoli (Maestro), Autosys, UC4
                › Scheduler instructs CloverETL Server via one of its interfaces (HTTP,JMX)

      • Events, tasks and dependencies
                ›    Tasks and transformations are started on internal or external events
                ›    Internal: transformation finished or failed
                ›    External: file arrived or its size changed
                ›    Allows creating dependencies between executed tasks
                ›    Suitable for logical sequencing of execution or monitoring purposes


All rights reserved Javlin 2011
Runtime automation (cont.)

      • Monitoring
                › Support alert emails or messages with configurable content
                › Automatically populated with execution status, log, statistics
                › Allow integration with ticketing systems, support team

      • Execution history
                › Automatically stores performance and statistics about each execution
                › Stored in database tables and log files
                › Open for any trend analysis and reporting

      • Archiving
                › Configurable cleanup of execution logs and history
                › Can be further extended by scripting




All rights reserved Javlin 2011
Scalability and optimization

      • Parallel execution
                › Executes multiple transformations in parallel
                › Can execute multiple instances of single transformations (SOA)

      • Graph pooling
                › Improves response time
                › Useful with SOA-architectures and Launch Services

      • Launch services
                › Applications implemented as data transformations and deployed as
                  RESTful web services
                › Schema on next slide




All rights reserved Javlin 2011
CloverETL Cluster
               Increased processing and throughput
                    - Parallel execution of transformations over multiple servers or nodes
                    - Load balancing based on individual node utilization

               Increased fault tolerance
                    - Fail over in case of problems with particular nodes
                    - Data replication

               Increased flexibility
                   - Cluster can be dynamically reconfigured by adding or removing nodes




       Large quantity of data loaded       Processed in parallel in Cluster   Written out in parallel

All rights reserved Javlin 2011
Session 3:
 Data Integration Projects


All rights reserved Javlin 2011
Data Integration Projects

      • This section covers
                ›    How to manage DI projects
                ›    Phases of DI project
                ›    Responsibilities
                ›    Typical issues




All rights reserved Javlin 2011
Data Integration Projects

      • Phases of typical Data Integration Project
                ›    Requirements
                ›    Planning
                ›    Analysis (of ETL steps)
                ›    Implementation
                ›    Documentation
                ›    RTP
                ›    Support




All rights reserved Javlin 2011
Requirements

      • Functional
                ›    Input data
                ›    Output data
                ›    Output data format
                ›    Transformation logic
      • Non-functional
                ›    Time restrictions
                ›    Availability
                ›    Frequency of update
                ›    Data Latency
                ›    How to handle erroneous records
                ›    Security requirements

All rights reserved Javlin 2011
Planning

      • ETL system implementation to be planned
        properly
                ›    Time for implementation
                ›    Correctly prioritized
                ›    Thorough data analysis extremely important
                ›    Unforeseen data quality issues cause delays
                ›    Biggest risk is unexpected data quality issues
                ›    Communicate properly


      • Keep it simple
                › Do no try to save the world
                › “If you think it can be done simply, do it simply”

All rights reserved Javlin 2011
Analysis – Extract

      • In which source systems is the data we need?
      • How can we access the system?
                ›    Flat files / database access / XML / web service / JMS
                ›    Full extract / incremental / change notification / on-request
                ›    Local access / ftp(s) / sftp / scp / MQ
      • Data Syntax
                ›    What is a record?
                ›    What is record delimiter?
                ›    What is field delimiter?
                ›    What is the data length / data type / format?
      • Data Semantics
                ›    What are the field names?
                ›    What data does the field represent?
                ›    Ok, what data does the field really represent?
                ›    Are there any duplicates?
      • What are the limitations / restrictions?
                ›    What are the data volumes?
                ›    How often can the export be done?
                ›    Any impact on network (NIA)?
      • What is the expected data growth rate?


All rights reserved Javlin 2011
Analysis – Clean & Transform

      • Which fields need to be validated?
                › … against which source?
                › How to handle erroneous data?

      • What is the data flow?
                › Between source and target data
                › What is the transformation logic?
                › Action on error?
                         • Stop transformation
                         • Process valid data only

      • Transformation restartability
                › Small data volumes – transaction based
                › Huge data volumes – process in bulk mode

All rights reserved Javlin 2011
Analysis – Load

      • What is the target schema?
      • What is the target data volume?
      • What are the history requirements
                › Usually SCD type 1, 2 or 3
      • Data Syntax
                ›    What is a record?
                ›    What is record delimiter?
                ›    What is field delimiter?
                ›    What is the data length / data type / format?
      • Data Semantics
                › What are the field names?
                › What data does the field represent?
                › Ok, what data does the field really represent?
      • What are the limitations / restrictions?
                › What are the data volumes?
                › How often can the export be done?
                › Any impact on network (NIA)?



All rights reserved Javlin 2011
Implementation

      • Development
                › Enforce standardization
                         •    Naming conventions
                         •    Best practices
                         •    Generating surrogate keys
                         •    Looking up keys
                         •    Applying default values
      • Testing
                › Review
                › Testing on Production data
                › Unit Testing


All rights reserved Javlin 2011
Implementation

      • Documentation
                › Data sources / targets / transformations
                › Data Lineage
                         • Important to know and publish
                › Frequency of ETL processes runs
                › Error handling
                › Support – Monitoring checklist


      • RTP



All rights reserved Javlin 2011
Planning & Leadership

      • Contact person for each source system
      • Contact business person
      • ETL team responsibilities
                › Define ETL scope
                › Perform source system data analysis
                › Define data quality strategy
                › Gather & document business rules from business
                  users
                › ETL Implementation
                › Defining & executing Unit & QA testing
                › Implementing production

All rights reserved Javlin 2011
ETL team Roles & Responsibilities

      •     ETL Manager
      •     ETL Architect
      •     ETL Developer
      •     System Analyst
      •     Data-Quality Specialist
      •     Database Administrator (DBA)
      •     Dimension Manager
      •     Fact Table Provider



All rights reserved Javlin 2011
Typical Issues

      •     Project Management
      •     Poor Data Analysis
      •     Data Understanding
      •     Performance
      •     Scalability




All rights reserved Javlin 2011
Typical Issues – Project Management

      • ETL takes 70-80% project resources
                › Be aware of this from beginning
                › Communicate this to stakeholders
                › Plan ETL phase properly


      • Data Quality
                › Unexpected data quality issues is biggest risk
                › DQ issues cause delays
                › DQ issues generate extra work




All rights reserved Javlin 2011
Typical Issues – Data Understanding

      • Source system (data)
                ›    Not documented
                ›    Documented incorrectly
                ›    Represent something else than they should
                ›    Data not clean


      • Transformation/Requirements
                ›    Not specified properly
                ›    Not specified at all
                ›    Initial analysis has not revealed issues/complexity
                ›    Requirements being changed
All rights reserved Javlin 2011
Typical Issues – Performance & Scalability

      • Performance
                › Is the performance ok now?
                › Will performance be ok in 5 years?


      • Scalability
                › What is the data growth rate?
                › Are we testing on production data volumes?

      • Change Data Capture
                › Time consuming task
                › Issue on old systems


All rights reserved Javlin 2011
Session 4:
 Current Trends


All rights reserved Javlin 2011
Market Trends

      • Shift to Semi-structured & Unstructured data
                › Emails, documents, blogs, …

      • Real-time processing
                › CRM, Zero-latency business
                › SOA, Web Services, ESB, JMS, MQ

      • Cloud-mania
                › Cloud, Cluster, Elastic Cluster
                › MapReduce, Apache Hadoop

      • Reducing TCO
                › License costs, Development cost, Maintenance
                › Emphasis on value; not price

      • Services for small customers
                › Require better ROI


All rights reserved Javlin 2011
Literature
      Ralph Kimball, Joe Caserta:
              The Data Warehouse ETL Toolkit; Wiley

      Ralph Kimball, Margy Ross:
              The Data Warehouse Toolkit; Wiley

      Len Silverston:
               The Data Model Resource Book; Wiley




All rights reserved Javlin 2011
Contact Javlin


       Web
       www.cloveretl.com



       US                                     Europe
       Javlin Inc.                            Javlin a.s.
       8000 Towers Crescent Drive             Křemencova 18
       Suite 1350                             110 00 Praha 1
       Vienna, VA 22182                       Czech Republic
       USA

       Web: www.javlininc.com                 Web: www.javlin.eu
       Email: info@javlininc.com              E-mail: info@javlin.eu
       Phone: +1 703 847 3600                 Phone: +420 277 003 200


All rights reserved Javlin 2011

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudMichael Rainey
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionDenodo
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouseKomal Choudhary
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Business intelligence ppt
Business intelligence pptBusiness intelligence ppt
Business intelligence pptsujithkylm007
 

Was ist angesagt? (20)

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
 
Ppt
PptPpt
Ppt
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Business intelligence ppt
Business intelligence pptBusiness intelligence ppt
Business intelligence ppt
 

Andere mochten auch

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Analytics Organization Modeling for Maturity Assessment and Strategy Development
Analytics Organization Modeling for Maturity Assessment and Strategy DevelopmentAnalytics Organization Modeling for Maturity Assessment and Strategy Development
Analytics Organization Modeling for Maturity Assessment and Strategy DevelopmentVijay Raj
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner
 
Gartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner
 
The CIO agenda: A compedium of Deloitte insights
The CIO agenda: A compedium of Deloitte insightsThe CIO agenda: A compedium of Deloitte insights
The CIO agenda: A compedium of Deloitte insightsDeloitte United States
 

Andere mochten auch (8)

CloverETL Basic Training Excerpt
CloverETL Basic Training ExcerptCloverETL Basic Training Excerpt
CloverETL Basic Training Excerpt
 
CloverETL and IBM Infosphere MDM partners and users
CloverETL and IBM Infosphere MDM partners and usersCloverETL and IBM Infosphere MDM partners and users
CloverETL and IBM Infosphere MDM partners and users
 
CloverETL Training Sample
CloverETL Training SampleCloverETL Training Sample
CloverETL Training Sample
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Analytics Organization Modeling for Maturity Assessment and Strategy Development
Analytics Organization Modeling for Maturity Assessment and Strategy DevelopmentAnalytics Organization Modeling for Maturity Assessment and Strategy Development
Analytics Organization Modeling for Maturity Assessment and Strategy Development
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner: Master Data Management Functionality
Gartner: Master Data Management Functionality
 
Gartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data Management
 
The CIO agenda: A compedium of Deloitte insights
The CIO agenda: A compedium of Deloitte insightsThe CIO agenda: A compedium of Deloitte insights
The CIO agenda: A compedium of Deloitte insights
 

Ähnlich wie Introduction to ETL and Data Integration

2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
2010.03.16 Pollock.Edw2010.Modern D Ifor WarehousingJeffrey T. Pollock
 
Etl testing contents
Etl testing contentsEtl testing contents
Etl testing contentsManoj Jagtap
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?HEXANIKA
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)pomishra
 
Chapter 4-ETL
Chapter 4-ETLChapter 4-ETL
Chapter 4-ETLteenoooo
 
Big data analytics beyond beer and diapers
Big data analytics   beyond beer and diapersBig data analytics   beyond beer and diapers
Big data analytics beyond beer and diapersKai Zhao
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...Jade Global
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers sebedatalabs
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemR-uturaj R-aval
 
Kuali OLE: A Look at our Software Deliverables Roadmap One Year On
Kuali OLE: A Look at our Software Deliverables Roadmap One Year OnKuali OLE: A Look at our Software Deliverables Roadmap One Year On
Kuali OLE: A Look at our Software Deliverables Roadmap One Year OnRobert H. McDonald
 
Informatica overview
Informatica overviewInformatica overview
Informatica overviewSwetha Naveen
 

Ähnlich wie Introduction to ETL and Data Integration (20)

2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
 
Etl testing contents
Etl testing contentsEtl testing contents
Etl testing contents
 
Where to Start ETL Developer Career
Where to Start ETL Developer CareerWhere to Start ETL Developer Career
Where to Start ETL Developer Career
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
ETL (1).ppt
ETL (1).pptETL (1).ppt
ETL (1).ppt
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?
 
Hadoop etl
Hadoop etlHadoop etl
Hadoop etl
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
 
Chapter 4-ETL
Chapter 4-ETLChapter 4-ETL
Chapter 4-ETL
 
Big data analytics beyond beer and diapers
Big data analytics   beyond beer and diapersBig data analytics   beyond beer and diapers
Big data analytics beyond beer and diapers
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL system
 
Kuali OLE: A Look at our Software Deliverables Roadmap One Year On
Kuali OLE: A Look at our Software Deliverables Roadmap One Year OnKuali OLE: A Look at our Software Deliverables Roadmap One Year On
Kuali OLE: A Look at our Software Deliverables Roadmap One Year On
 
Informatica overview
Informatica overviewInformatica overview
Informatica overview
 
Golden gate11g overview - Edgars Rungis
Golden gate11g overview - Edgars RungisGolden gate11g overview - Edgars Rungis
Golden gate11g overview - Edgars Rungis
 
ETL Fundamentals
ETL FundamentalsETL Fundamentals
ETL Fundamentals
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Introduction to ETL and Data Integration

  • 1. ETL March 2011 Mgr. Jan Ulrych All rights reserved Javlin 2011
  • 2. Organizational Matters • Introduction to ETL • More about ETL • CloverETL intro • ETL Projects • Current Trends All rights reserved Javlin 2011
  • 3. Presenter • Jan Ulrych › Graduated from Faculty of Mathematics and Physics at Charles University, Prague › Works for Javlin a.s. as ETL Consultant since 2008 › E-mail: jan.ulrych@javlin.eu • Professional experience › DVRA project at DHL IT Services Europe, Prague › ETL Consultant on various data integration projects › Since 2010 is a Pre-sales consultant for CloverETL™ All rights reserved Javlin 2011
  • 4. Javlin Overview • Javlin – since 2005 • Javlin is a software developer and services provider › CloverETL platform › Javlin services and ETL consulting › Software development for major clients • Employees: staff of 60+ › Developers and consultants › Service, sales, support › Executive management • Experienced management team › Data and ETL software development legacy › Key industry expertise – finance, health, media, logistics, government • Office locations › Prague › Brno › Greater Washington DC All rights reserved Javlin 2011
  • 5. Selected Customers All rights reserved Javlin 2011
  • 6. Session 1: Origins & Motivation All rights reserved Javlin 2011
  • 7. Data Warehousing • “A data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.” Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004 • The most visible part is “querying and analysis” • The most complex and time consuming part is “extracts, cleans, conforms, and delivers” All rights reserved Javlin 2011
  • 8. Data Warehousing • The most complex and time consuming part is “extracts, cleans, conforms, and delivers” • How complex is it? › 70-80% of BI (DI or DW) project is reliable ETL process All rights reserved Javlin 2011
  • 9. Getting data into DW • How to load data into DW? › Scripts in linux shell, perl, python, … › sqlldr + SQL › Hardcoded in Java, C#, C › In-house built ETL tool › Off-the shelf ETL tool • Aspects to be kept in mind › Manageability › Maintainability › Transparency › Scalability › Flexibility › Complexity › Auditing › Job restartability › Testing All rights reserved Javlin 2011
  • 10. Session 1: Introduction to ETL All rights reserved Javlin 2011
  • 11. ETL • How to load data into DW in a right way? › Introduce formal ETL process • This section covers › What ETL is › Motivation › Where to use ETL › How to Implement ETL › Key ETL Aspects All rights reserved Javlin 2011
  • 12. Motivation • Is ETL interesting area? › 70-80% of BI (DI or DW) project is reliable ETL process. • Let’s have a look on the DW & DI market size › In 2003, DI was USD 9.3 billion market › In 2008, DI was USD 13 billion market › By 2010, yearly grow estimated to USD 2.2 billion › TCO of DI can reach USD 509,600 annually • The more systems in the world, the more work in Data Integration! All rights reserved Javlin 2011
  • 13. What is ETL? • ETL = Extract – Transform – Load • Extract › Get the data from source system as efficiently as possible • Transform › Perform calculations on data • Load › Load the data in the target storage All rights reserved Javlin 2011
  • 14. What is ETL? • ETL = Extract – Transform – Load • Extract › Get the data from source system as efficiently as possible • Clean › Perform data cleansing and dimension conforming • Transform › Perform calculations on data • Load › Load the data in the target storage All rights reserved Javlin 2011
  • 15. Why is ETL (System) Important? • Adds value to data › Removes mistakes and corrects data › Documented measures of confidence in data › Captures the flow of transactional data › Adjusts data from multiple sources to be used together (conforming) › Structures data to be usable by BI tools › Enables subsequent business / analytical data procesing All rights reserved Javlin 2011
  • 16. ETL Disambiguation • ETL = Extract – Transform – Load › Not tight specifically to DW anymore • Process/System › A complete process including • Data extraction • Enforcing DQ and consistency standards • Conforming data from disparate systems • Delivering data to target • People, HW, Documentation, Support, etc. • Tool › A piece of software implementing the • three (four) E-(C)-T-L steps. • A tool designed specifically to perform data transformations All rights reserved Javlin 2011
  • 17. ETL Process Presentation Dashboards, Reports, Portals Analytics BI Tools (SAP, Cognos), KPI, Data Mining Integrated data value for applications ETL Data Integration ETL Extracting value Data from the database DBMS (MS SQL, MySQL, Oracle), XML, flat files, CSV, mainframe All rights reserved Javlin 2011
  • 18. ETL Tool: True Data Integration Source A ETL Source B Files, Databases, Read Files, Databases, Message Queues, Apply Logic Message Queues, Web Services Write Web Services True Data Integration is agnostic of source or target application ETL is a bridge for bi-directional flow All rights reserved Javlin 2011
  • 19. ETL Data Integration Solutions (1) Data Migration Process of transferring data between storage types or formats. An automated migration frees up human resources from tedious tasks. Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs. Data Consolidation Usually associated with moving data from remote locations to a central location or combining data due to an acquisition or merger. Data Integration Process of combining data residing at different sources and providing a unified view. Emerges in both commercial and scientific fields and is focus of extensive theoretical work. Also referred to as Enterprise Information Integration. All rights reserved Javlin 2011
  • 20. ETL Data Integration Solutions (2) Master Data Management Processes and tools to define and manage non-transactional data. Provides for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data to an organization to ensure consistency and control. Data Warehouse Repository of electronically stored data. ETL facilitates populating, reporting and analysis. Includes business intelligence as well as metadata retrieval and management tools. Data Synchronization Process of making sure two or more locations contain the same up-to-date files. Add, change, or delete a file from one location, synchronization will mirror the action at the new location. All rights reserved Javlin 2011
  • 21. Where is ETL used? All rights reserved Javlin 2011
  • 22. How to Implement ETL System (1) Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004 All rights reserved Javlin 2011
  • 23. How to Implement ETL System (2) • Scripting (shell, perl, python) • PL/SQL, sqlldr • Transformation hardcoded in Java, C# • Develop (universal) ETL tool in-house • Using off-the-shelf ETL tool All rights reserved Javlin 2011
  • 24. ETL Tool Key Features (1) • Extract, Load => flexible on interfaces › Flat files, DBMS, XML data, XLS, › MQ, web services, LDAP › Semi-structured data (emails, web logs, wiki pages) › Unstructured data (blogs, documents) › Extensibility with custom connectors › Local data, remote data FTP(S), SFTP, SCP, http(s) • Clean › Lookups, Validations, Filters, Translations • Transform › Changing data structure, Joins, (De)Normalization, Aggregation, RollUp, Sorting, Partitioning, Data De-duplication › Ability to call external tools All rights reserved Javlin 2011
  • 25. ETL Tool Key Features (2) • Performance › Symmetric Multiprocessing (SMP) • Pipeline processing • Multithreaded processing › Massively Parallel Processing (MPP) • Clustering • MapReduce › Load balancing • User friendliness › GUI › Metadata capture › Training time • Development › Reusable components › Impact Analysis / Data Lineage › Documentation All rights reserved Javlin 2011
  • 26. ETL Tool Key Features (3) • Manageability › Team collaboration › Transformation repository › Metadata repository › Development process (Dev -> Test -> Prod) › Security • Runtime › Scheduler Automation › Recovery and Restart › Workflow • Others › Vendor stability › Release cycle › Support All rights reserved Javlin 2011
  • 27. ETL Market Source: Ted Friedman, Mark A. Beyer, Eric Thoo: Magic Quadrant for Data Integration Tools; Gartner RAS Core Research Note G00207435; 19 November 2010 All rights reserved Javlin 2011
  • 28. Well Known ETL Tools • Commercial › Ab Initio › IBM DataStage › Informatica PowerCenter › Microsoft Data Integration Services › Oracle Data Integrator › SAP Business Objects – Data Integrator › SAS Data Integration Studio • Open-source based › Adeptia Integration Suite › Apatar › CloverETL › Pentaho Data Integration (Kettle) › Talend Open Studio/Integration Suite The list above is not meant to be comprehensive All rights reserved Javlin 2011
  • 29. ETL or ELT? • ETL = Extract – Transform – Load › Much more flexible • ELT = Extract – Load – Transform › Pushed forward (mainly) by Oracle • First get the data into database • Then use Oracle DB tools to work with it › Less flexible • Tightly coupled to vendor’s database/solution • Less flexible on output formats • Requires staging area › Possibly better performance • All data are processed in the same database • Nothing is downloaded from database for Transform step All rights reserved Javlin 2011
  • 30. Session 2: Dominate Your Data with CloverETL™ All rights reserved Javlin 2011
  • 31. What is CloverETL Clover works on all OS • A data integration software platform › Linux › Manages, designs and runs your data › Windows › Embeddable and scalable › HP-UX › Integrates easily with databases, operating › AIX systems and applications › IBM AS/400 › Solaris › Mac OS X • ETL platform that dominates your data Extract – Transform – Load › Reads from one or more data sources › Transforms data in almost any way imaginable › Writes to any number of data targets • Legacy of open source and commitment to commercial use • CloverETL Engine can be used as embedded OEM as well www.cloveretl.com All rights reserved Javlin 2011
  • 32. CloverETL Key Features Platform independent Java, integration, library support Cost Scalability delivers lowest total Desktop → Enterprise cost of ownership → Cluster OEM Embeddable Usability small footprint, built by ETL experts extensible and for ETL experts customizable Services Performance Clover was built to outstanding at all require minimal production levels services All rights reserved Javlin 2011
  • 33. CloverETL Product Suite CloverETL Server Design Manage • Production Platform • Web app’s (Tomcat, WebSphere, GlassFish) • For Enterprise Integration CloverETL Designer • Visual Designer • Transformation Developer • Eclipse Platform CloverETL Engine • Pure Java 6.0 • Embeddable • Extensible Runtime • Ideal for OEM All rights reserved Javlin 2011
  • 34. CloverETL Designer Features • Transformation design • Intuitive GUI • Drag & drop • Components Library • Debug All rights reserved Javlin 2011
  • 35. CloverETL Vision Vision CloverETL enables companies to get more value from their data quickly without massive infrastructure expense or years of project investment. Dominate your data Approach CloverETL is the best value for money. It builds off the open source foundation of the CloverETL Engine and scales from desktop to enterprise to cluster. Get it done quickly Investment CloverETL is an easy to use ETL software that can grow by both core features and user expertise… at a fraction of the cost of larger system vendors. Low cost buy-in for our customers All rights reserved Javlin 2011
  • 36. CloverETL Approach • Components › Prebuilt algorithms Input • Graph Transform › Processing algorithm in visual form • Data Flow › Edges between components • Process › Build > Connect > Configure › Data processed as structured records with named fields › Components operate on Output record fields All rights reserved Javlin 2011
  • 37. Transformation capabilities • 50+ specialized components available for use › Readers and writers • Text or binary files, CSV, XML • Archives including ZIP or GZIP • Remote transfers over HTTP/FTP(S) • Access to messaging via JMS › Database connectivity • JDBC connectivity, support for bulk loaders • Supports any SQL statements, stored procedure calls › Transformers and aggregators • Variety of data manipulation algorithms: sorting, deduping, joining, arithmetics, aggregating, statistic functions and more • Customizable with user-defined code • Alternative implementations for efficient execution All rights reserved Javlin 2011
  • 38. Physical architecture • Others: Server-centric with thin clients › Server is necessary for development and execution › Transformations and data are stored remotely on server › Limited use when working on multiple sites or in restricted-access networks • CloverETL: Designer is a standalone application › Development and execution is possible without central server › Transformations and metadata are stored on local machine › Transformations can be also deployed to server for centralized management › Designer available for all major platforms (Linux, Windows, Mac) All rights reserved Javlin 2011
  • 39. Repository • Others: Central repository › With proprietary storage format (binary files, tables) › Often managed with a proprietary version control system › Can be problematic with mass changes or when “hacking” is necessary • CloverETL: XML format files and directories › No proprietary repository, uses plain files and directories › Transformation and metadata stored in human-readable XML › Open to any version management systems • CVS, SVN, IBM Rational ClearCase, git › Integrated with Eclipse VMS clients • Subclipse, EGit, ClearCase All rights reserved Javlin 2011
  • 40. Flexibility and expressive power • Others: Expression-based languages only › Expressions and built-in functions for simple data manipulation › No support for programming statements (loops, user functions) › Limited when data manipulation requires complex coding • CloverETL: Scripting and more › Components are customizable with CTL or Java code › CTL: scripting language with simple syntax • Allows simple expressions as well as complex code • Has variables, loops, user-defined functions • Many built-in data validation functions › Java: mature programming language, allow coding • Access to variety of existing libraries • GC-based memory management = rapid development All rights reserved Javlin 2011
  • 41. Extensibility • Others: Custom components › Limited support for developing custom components › Impossible to extend other aspects of data processing • CloverETL: Plugin-based extensible platform › Ready to extend, customize and modify › Plugin-based architecture for easy extension management › Supports custom components, connections, functions › Implemented in Java • Access to libraries • Memory management • Developers All rights reserved Javlin 2011
  • 42. Session 2: CloverETL™ Designer Examples All rights reserved Javlin 2011
  • 43. CloverETL Server Features • Automation and scheduling • File and event triggers • Workflows • Monitoring and logging • User management • Real-time ETL • Clustering • Load balancing • Failover • Distributed processing • Inexpensive buy-in All rights reserved Javlin 2011
  • 44. Key features • Runtime automation › Allows integration with existing infrastructure › Simplified management and execution of data transformations • Scalability and optimization › Increases transformation performance › Shortens response time in continuous/transactional processing • Security › Controls access to data, transformations and server configuration › Secures communication between server and clients • Clustering › Allows cooperation between multiple processing nodes › Improves scalability, performance and error resiliency All rights reserved Javlin 2011
  • 45. Runtime automation • Internal scheduler › Supports one-time or periodic execution › Allows interval-based scheduling, including flexible cron-like rules • Integration with enterprise schedulers › Transformation execution can be started by external scheduler › cron, IBM Tivoli (Maestro), Autosys, UC4 › Scheduler instructs CloverETL Server via one of its interfaces (HTTP,JMX) • Events, tasks and dependencies › Tasks and transformations are started on internal or external events › Internal: transformation finished or failed › External: file arrived or its size changed › Allows creating dependencies between executed tasks › Suitable for logical sequencing of execution or monitoring purposes All rights reserved Javlin 2011
  • 46. Runtime automation (cont.) • Monitoring › Support alert emails or messages with configurable content › Automatically populated with execution status, log, statistics › Allow integration with ticketing systems, support team • Execution history › Automatically stores performance and statistics about each execution › Stored in database tables and log files › Open for any trend analysis and reporting • Archiving › Configurable cleanup of execution logs and history › Can be further extended by scripting All rights reserved Javlin 2011
  • 47. Scalability and optimization • Parallel execution › Executes multiple transformations in parallel › Can execute multiple instances of single transformations (SOA) • Graph pooling › Improves response time › Useful with SOA-architectures and Launch Services • Launch services › Applications implemented as data transformations and deployed as RESTful web services › Schema on next slide All rights reserved Javlin 2011
  • 48. CloverETL Cluster  Increased processing and throughput - Parallel execution of transformations over multiple servers or nodes - Load balancing based on individual node utilization  Increased fault tolerance - Fail over in case of problems with particular nodes - Data replication  Increased flexibility - Cluster can be dynamically reconfigured by adding or removing nodes Large quantity of data loaded Processed in parallel in Cluster Written out in parallel All rights reserved Javlin 2011
  • 49. Session 3: Data Integration Projects All rights reserved Javlin 2011
  • 50. Data Integration Projects • This section covers › How to manage DI projects › Phases of DI project › Responsibilities › Typical issues All rights reserved Javlin 2011
  • 51. Data Integration Projects • Phases of typical Data Integration Project › Requirements › Planning › Analysis (of ETL steps) › Implementation › Documentation › RTP › Support All rights reserved Javlin 2011
  • 52. Requirements • Functional › Input data › Output data › Output data format › Transformation logic • Non-functional › Time restrictions › Availability › Frequency of update › Data Latency › How to handle erroneous records › Security requirements All rights reserved Javlin 2011
  • 53. Planning • ETL system implementation to be planned properly › Time for implementation › Correctly prioritized › Thorough data analysis extremely important › Unforeseen data quality issues cause delays › Biggest risk is unexpected data quality issues › Communicate properly • Keep it simple › Do no try to save the world › “If you think it can be done simply, do it simply” All rights reserved Javlin 2011
  • 54. Analysis – Extract • In which source systems is the data we need? • How can we access the system? › Flat files / database access / XML / web service / JMS › Full extract / incremental / change notification / on-request › Local access / ftp(s) / sftp / scp / MQ • Data Syntax › What is a record? › What is record delimiter? › What is field delimiter? › What is the data length / data type / format? • Data Semantics › What are the field names? › What data does the field represent? › Ok, what data does the field really represent? › Are there any duplicates? • What are the limitations / restrictions? › What are the data volumes? › How often can the export be done? › Any impact on network (NIA)? • What is the expected data growth rate? All rights reserved Javlin 2011
  • 55. Analysis – Clean & Transform • Which fields need to be validated? › … against which source? › How to handle erroneous data? • What is the data flow? › Between source and target data › What is the transformation logic? › Action on error? • Stop transformation • Process valid data only • Transformation restartability › Small data volumes – transaction based › Huge data volumes – process in bulk mode All rights reserved Javlin 2011
  • 56. Analysis – Load • What is the target schema? • What is the target data volume? • What are the history requirements › Usually SCD type 1, 2 or 3 • Data Syntax › What is a record? › What is record delimiter? › What is field delimiter? › What is the data length / data type / format? • Data Semantics › What are the field names? › What data does the field represent? › Ok, what data does the field really represent? • What are the limitations / restrictions? › What are the data volumes? › How often can the export be done? › Any impact on network (NIA)? All rights reserved Javlin 2011
  • 57. Implementation • Development › Enforce standardization • Naming conventions • Best practices • Generating surrogate keys • Looking up keys • Applying default values • Testing › Review › Testing on Production data › Unit Testing All rights reserved Javlin 2011
  • 58. Implementation • Documentation › Data sources / targets / transformations › Data Lineage • Important to know and publish › Frequency of ETL processes runs › Error handling › Support – Monitoring checklist • RTP All rights reserved Javlin 2011
  • 59. Planning & Leadership • Contact person for each source system • Contact business person • ETL team responsibilities › Define ETL scope › Perform source system data analysis › Define data quality strategy › Gather & document business rules from business users › ETL Implementation › Defining & executing Unit & QA testing › Implementing production All rights reserved Javlin 2011
  • 60. ETL team Roles & Responsibilities • ETL Manager • ETL Architect • ETL Developer • System Analyst • Data-Quality Specialist • Database Administrator (DBA) • Dimension Manager • Fact Table Provider All rights reserved Javlin 2011
  • 61. Typical Issues • Project Management • Poor Data Analysis • Data Understanding • Performance • Scalability All rights reserved Javlin 2011
  • 62. Typical Issues – Project Management • ETL takes 70-80% project resources › Be aware of this from beginning › Communicate this to stakeholders › Plan ETL phase properly • Data Quality › Unexpected data quality issues is biggest risk › DQ issues cause delays › DQ issues generate extra work All rights reserved Javlin 2011
  • 63. Typical Issues – Data Understanding • Source system (data) › Not documented › Documented incorrectly › Represent something else than they should › Data not clean • Transformation/Requirements › Not specified properly › Not specified at all › Initial analysis has not revealed issues/complexity › Requirements being changed All rights reserved Javlin 2011
  • 64. Typical Issues – Performance & Scalability • Performance › Is the performance ok now? › Will performance be ok in 5 years? • Scalability › What is the data growth rate? › Are we testing on production data volumes? • Change Data Capture › Time consuming task › Issue on old systems All rights reserved Javlin 2011
  • 65. Session 4: Current Trends All rights reserved Javlin 2011
  • 66. Market Trends • Shift to Semi-structured & Unstructured data › Emails, documents, blogs, … • Real-time processing › CRM, Zero-latency business › SOA, Web Services, ESB, JMS, MQ • Cloud-mania › Cloud, Cluster, Elastic Cluster › MapReduce, Apache Hadoop • Reducing TCO › License costs, Development cost, Maintenance › Emphasis on value; not price • Services for small customers › Require better ROI All rights reserved Javlin 2011
  • 67. Literature Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley Ralph Kimball, Margy Ross: The Data Warehouse Toolkit; Wiley Len Silverston: The Data Model Resource Book; Wiley All rights reserved Javlin 2011
  • 68. Contact Javlin Web www.cloveretl.com US Europe Javlin Inc. Javlin a.s. 8000 Towers Crescent Drive Křemencova 18 Suite 1350 110 00 Praha 1 Vienna, VA 22182 Czech Republic USA Web: www.javlininc.com Web: www.javlin.eu Email: info@javlininc.com E-mail: info@javlin.eu Phone: +1 703 847 3600 Phone: +420 277 003 200 All rights reserved Javlin 2011