SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Bill Hayduk
Founder, CEO
a software division ofQuerySurge™
The Data World Distilled:
Understanding how the data world works in the Big Data era
QuerySurge™
About
FACTS
RTTS Founded:
1996
Location:
New York, NY
(Headquarters)
Customer profile:
Fortune 1000
Software Offering
QuerySurge (2012)
QuerySurge Partners:
• 11 industry-leading
Technology Partners
• 14 global System
Integrators
• 22 regional consulting
firms
RTTS is the parent company of QuerySurge and began as a
consulting firm centered on QA & testing
a software division of
Technology Partners
System Integrators
Sales & Consulting
Partners
Data Warehouse Marketplace
“the worldwide data warehouse management software market is forecast
to generate nearly $17 billion in revenue by 2019” - Forrester
Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon
Business Intelligence Marketplace
“The business intelligence (BI) and analytics software market is forecast to grow to
$22.8 billion by the end of 2020” - Gartner
SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders
DWH, BI, Big Data Marketplaces
a software division ofQuerySurge™
Big Data Marketplace
“By the end of 2020, companies will spend > USD $72 billion on on Big Data
hardware, software, & professional services” - IDC
Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata,
SAP, MongoDB, MapR, DataStax, Snowflake.
Fast Facts about Data
• By the end of 2020, companies will spend > USD $72 billion
on Big Data hardware, software, & professional services
(the current market size is USD $46 billion)
• > 75% of companies are investing or planning to invest in
Big Data in the next 2 years
• Professional services represents 43% of the Big Data market
(services=USD $31 Billion of $72 Billion)
a software division ofQuerySurge™
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
What is Big Data?
a software division ofQuerySurge™
Big Data: defined as too much volume, velocity and variability to work on normal
database architectures.
“The market for big data is $70 billion and growing by
15% a year.”
- EMC COO Pat Gelsinger
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
a software division ofQuerySurge™
What is Big Data?
Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
a software division ofQuerySurge™
the Big Data Impact
What is ?
• easily deals with complexities of high of data
Hadoop is an open source project that develops software for scalable,
distributed computing.
• is a of large data sets across
clusters of computers using simple programming models.
from single servers to 1,000’s of machines, each offering local
computation and storage.
• detects and at the application layer
a software division ofQuerySurge™
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
a software division ofQuerySurge™
Key Attributes of Hadoop
Top Vendors
built by
QuerySurge™
““By the end of 2020, companies will spend more than USD $72 billion on
on Big Data hardware, software, & professional services” - IDC
MapReduce
(Task Tracker)
HDFS
(Data
Node)
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
machine
a software division ofQuerySurge™
Basic Hadoop Architecture
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node
a software division ofQuerySurge™
Basic Hadoop Architecture(continued)
MapReduce
(Task Tracker)
HDFS
(Data
Node)HiveQLHiveQL
HiveQLHiveQL
HiveQL
Apache Hive - a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
• create
• insert
• update
• delete
• select
a software division ofQuerySurge™
Apache Hive
What is NoSQL?
A term used to describe high-performance, non-relational databases that provide a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational databases
NoSQL Database Types
Document databases pair each key with a complex data structure known as a document. Documents can contain
many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections. Graph stores
include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute
name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value
stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store
columns of data together, instead of rows.
a software division ofQuerySurge™
About
Top Vendors
built by
QuerySurge™
built by
QuerySurge™
• Online real-time processing
• Data set is smaller
• Measured in milliseconds
• Offline big data processing
• Offline analytics
• Measured in minutes & hours
Source: classpattern.com
When to use NoSQL? / When to use Hadoop?
NoSQL versus Hadoop
built by
QuerySurge™
Source: MongoDB, Inc.
Data Warehouse Batch Aggregation
ETL from MongoDB
ETL to MongoDB
NoSQL Example: Use Cases
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
a software division ofQuerySurge™
What is a Data Warehouse?
Data Warehouse
• typically a relational database that is designed for query and analysis
rather than for transaction processing
• a place where historical data is stored for archival, analysis and
security purposes.
• contains either raw data or formatted data
• combines data from multiple sources
• Sales
• salaries
• operational data
• human resource data
• inventory data
• web logs
• Social networks
• Internet text and docs
• other
Legacy DB
CRM/ERP
DB
Finance DB
a software division ofQuerySurge™
What is a Data Warehouse?
“The worldwide data warehouse management software market is
forecast to generate nearly $17 billion in revenue by 2019”
- Forrester
Data Warehouse size
Small data warehouses: < 5 TB
Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm Gartner
Leaders in on-premises Data Warehouse Data Management Systems
- Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’a software division ofQuerySurge™
Data Warehouse - the marketplace
Alternate Delivery Models
a software division ofQuerySurge™
Data Warehouse - the marketplace
Leading Cloud DWHs
Oracle founder Larry Ellison with an
Exadata appliance
Leading Appliance DWHs
An appliance is
software and
servers
optimized
together.
Why build a Data Warehouse?
• Data stored in operational systems (OLTP) not
easily accessible
• OLTP systems are not designed for end-user
analysis
• The data in OLTP is constantly changing
• May be deficient in historical data
• Diverse forms of data stored in different platforms
and/or dissimilar formats
a software division ofQuerySurge™
Data Warehouse - Business Case
The Data Warehouse Business Solution
• Collects data from different sources (other
databases, files, web services, etc)
• Integrates data into logical business areas
• Provides direct access to data with powerful
reporting tools (BI)
a software division ofQuerySurge™
Data Warehouse - Business Case
The Data Warehouse data
• Subject-oriented
• Integrated
• Non-volatile
• Time-variant
a software division ofQuerySurge™
Data Warehouse - about the data
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL / Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it can serve its
purpose of facilitating business analysis.
a software division ofQuerySurge™
Data Integration & the ETL process
Extract - data from one or more OLTP systems and copy
into the warehouse
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Load – map the data, transform and/or load it into the DWH.
The ETL function is either performed by home-grown software that someone wrote or
through commercial software
Legacy DB
CRM/ERP
DB
Finance DB
the ETL process
Source Data ETL Process Target DWH
a software division ofQuerySurge™
Extract
Transform
Load
Leaders in ETL Solutions
a software division ofQuerySurge™
Continuous Integration/ETL solutions - the Marketplace
(ab initio)
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
a software division ofQuerySurge™
Business Intelligence (BI)
Business Intelligence – What is it?
• Software applications used in spotting, digging-out, and
analyzing business data
• BI provides simple access to data which can be used in day
to day operations, integrates data into logical business areas
• BI provides historical, current and predictive views of
business operations
• BI is made up of several related activities, including data
mining, online analytical processing, querying and reporting.
a software division ofQuerySurge™
Business Intelligence (BI)
Business Intelligence software is like reporting engines on steroids
“The business intelligence (BI) and analytics software market is forecast to
grow to $22.8 billion by the end of 2020”
“The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to
consolidate the market, owning 59 percent of the market share. ”
- Analyst firm Gartner
a software division ofQuerySurge™
BI & Analytics - the marketplace
- Analyst firm Forrester Research’s ‘Forrester Wave’
Leaders in BI
Wal-Mart uses vast amounts of data and category analysis to
dominate the industry.
Amazon and Yahoo follow a "test and learn" approach to
business changes.
Hardee’s, Wendy’s, and T.G.I. Friday’s use BI
to make strategic decisions.
a software division ofQuerySurge™
Business Intelligence (BI) - Who uses it?
Data Mart
A database that has the same characteristics as a data warehouse, but is
usually smaller and is focused on the data for one division or one
workgroup within an enterprise.
Typically hold aggregated data and some granular data. It is a subset of the
DWH and makes it more efficient for Business Intelligence reporting. BI tools
sit on top of the data marts.
a software division ofQuerySurge™
Business Intelligence (BI) & Data Marts
Legacy DB
CRM/ERP DB
Finance DB
Source Data ETL Process Target DW ETL Process Data Mart
Legacy DB
CRM/ERP
DB
Finance DB
Source Data
ETL Process
Target DWH
ETL Process
a software division ofQuerySurge™
Business Intelligence (BI) & Analytics
Data Mart
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
built byQuerySurge™
Data Quality Best Practices boost revenue by 66%.
46% of companies cite Data Quality as a barrier for
adopting Business Intelligence products.
80% of organizations… will underestimate the costs related
to the data acquisition tasks by an average of 50 percent.
Data Quality Issues
The average organization loses $14.2 million annually
through poor Data Quality.
o Profiling
o Parsing and standardization
o Generalized Cleansing
o Matching
o Monitoring
o Enrichment
o Subject-area-specific support
o Metadata management
o Configuration environment
Data Quality
QuerySurge™
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
a software division of
“The market for data quality software tools reached $1.61 billion in 2017 (the
most recent year for which Gartner has data), an increase of 11.6% over 2016.
Gartner’s interactions with clients also indicate that demand remains high.”
- Analyst firm Gartner
a software division ofQuerySurge™
Data Quality - the marketplace
- Analyst firm Gartner’s Magic Quadrant
Leaders in Data Quality
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
o Profiling
o Parsing and standardization
o Generalized Cleansing
o Matching
o Monitoring
o Enrichment
o Subject-area-specific support
o Metadata management
o Configuration environment
Data Quality vs. Data Testing
QuerySurge™
▪ Data Completeness
▪ Data Transformation
▪ Regression Testing
▪ Reporting
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
Data
Verification &
Validation?
Primary Characteristics of Data Testing tools
Courtesy of the book "Testing the Data Warehouse Practicum"
Data
Verification &
Validation?
a software division of
a software division ofQuerySurge™
Where Data Testing fits in your data strategy
Business Intelligence & Analytics
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
“The average organization loses
$14.2 million annually through
poor Data Quality.”
- Gartner
Data Architecture
The Executive Office and Critical Data
Typical data issue
areas
ETL
Mainframe
Data Analyst: Creates data requirements (source-to-
target map or mapping doc)
Data Architect: Models and builds data store (Big Data
lake, Data Warehouse, etc.)
ETL Developer: Transforms and loads data from
sources to target data stores
Data Tester: Validates the data, based on mappings,
as it moves and transforms from sources to targets
Key Roles in Building & Testing a Data Store
a software division ofQuerySurge™
a.k.a. Source-to-Target Map
It’s the critical element required
to efficiently plan the target Data
Stores. It also defines the Extract,
Transform, Load (ETL) process.
Intention:
✓ capture business rules
✓ data flow mapping and
✓ data movement requirements.
Mapping Doc specifies:
▪ Source input definition
▪ Target/output details
▪ Business & data transformation rules
▪ Absolute data quality requirements
▪ Optional data quality requirements.
a software division ofQuerySurge™
Data Requirements = Mapping Document
Sampling
• Review Business Rules (i.e. mapping document, data flow mappings)
• Write Tests in SQL editor
• Execute 2 Tests: 1 at Source & 1 at Target
• Export results to 2 Excel files
• Compare a Sampling of results by eye (‘Stare & Compare’)
Issue with Stare & Compare:
Impossible to visually compare billions of data sets
Result: usually less than 1% of data is compared
Example - Current QuerySurge customer
• one test = 100 million rows X 200 columns = 20 billion data sets
• there is no practical way to manually verify (eyeball) this data set
• the client has more than 15,000 total tests
a software division of
Most Common Data Validation Method
QuerySurge™
Huge Risk
Roles Tasks
Timeline
Data Analyst
Data Architect
ETL Developer
Data Tester
Model and
build target
Data Stores
Review
Mapping
Document
Maintain
Target Data
Stores
Create 2 SQL
tests for each
mapping with
SQL editor
Review
Mapping
Document
Dump
results of
tests to 2
Excel files
Compare
Excel files
by eye
Execute
tests
Determine
Requirements
Create & maintain
Mapping
Document
iterate
iterate
Data Store Roles, Tasks, & Timelines
Review
Mapping
Document
Extract & load data or
extract, transform, &
load data
Build data
movement
logic
a software division of
About QuerySurge
QuerySurge™
is the leading testing solution for
automated validation & testing of Big Data
QuerySurge
Use Cases
a software division of
What is QuerySurge?
a software division ofQuerySurge™
QuerySurge connects
to any 2 points
at one time
How QuerySurge Works
SQL
HQL
SQL
Comparison of every data set
Source
Data
Target
Data
Data Intelligence Reports,
Data Health Dashboard,
automated email reports
Results – pass/fail
Target Data
Big Data
stores
• Hadoop
• NoSQL
Data
Warehouses
XML
Web Services
Source Data
Data Stores
• Databases
• Data Warehouses
• Data Marts
Flat Files
• Fixed Width
• Delimited
• Excel
• JSON
Business Intelligence
Reports
ETL Developer: Codes data movement based on Mapping Requirements
Data Warehouse
ETL
Data Tester: Tests data movement based on Mapping Requirements
Data Mart
ETL
Source Data Big Data lake
Testing Point #1 Testing Point #2 Testing Points #3
BI & Analytics
BI Analyst extracts
data for reports
Testing Point #4
Tester tests BI
Reports
Big Data Process - Developer & Tester
QuerySurge supports the following data stores…
• Amazon Redshift, Elastic Map Reduce, DynamoDB
• Apache Hadoop/Hive, Spark
• Cassandra
• Cloudera
• Couchbase
• Exasol
• Flat Files (delimited, fixed-width)
• Hortonworks
• IBM (Db2, Netezza, Informix, Big Insights, Cloudant, MDM, Cognos)
• JSON files
• Mainframe
• MAPR
• Micro Focus Vertica
• Microsoft (SQL Server DWH, HDInsight, PDW, SSAS, Excel, Access,
SharePoint)
• MongoDB
• Oracle (Oracle DB, MySQL, Exadata, NoSQL, Hadoop)
• Pivotal GreenPlum
• PostgreSQL
• Salesforce
• SAP (HANA, IQ, ASE, SQL Anywhere, Altiscale Data Cloud)
• Snowflake
• Tableau
• Teradata, Aster
• Workday
• XML …and any other data store
QuerySurge Supports 50+ Data Stores
Flat Files
Excel
Data
Warehouse
Data Quality
Data Testing
Big Data
a software division ofQuerySurge™
ETL/ Data
Integration
BI & Analytics
Data
Governance
The Data World Distilled
1) Data stewardship
Identifying and assigning roles and responsibilities.
- who is creating its data,
- who has overall responsibility for the data,
- who uses the data, who routes it,
- who oversees its use.
2) Data classification
Identify and categorize data types into groups.
3) Data quality
Data quality - the process of measuring the reliability of current data sets to provide
information that can be used to make organizational decisions.
4) Data management
Process where all the organization's data governance efforts come together. The
company actively manages its data governance efforts and involves the creation of the
architectures and business processes required to properly maintain the organization’s
data through its full lifecycle.
4 main components of successful data governance
source: IBM Data Governance Council Maturity Model
• Patterned after the Capability
Maturity Model
Integration(CMMI) from the
Software Engineering Institute
(SEI) at Carnegie Mellon
University
• Devised by IBM, along with 55
other companies
• Few stable processes exist
• “Just do it” mentality
• Data-related policies become more clear & reflect the
organization’s data principles.
• Data integration opportunities are better leveraged.
• Risk assessment for data integrity & quality becomes part of the
organization’s project methodology.
• Further defined value of data for more data elements
• Data Governance methodology is introduced during the
planning stages of new projects
• Enterprise data models are documented & published
• Data Governance is second nature
• ROI for data-related projects is tracked
• Business value of data mgmt is recognized
• Cost of data mgmt is easier to manage
• Costs are reduced as processes become
automated
• More data-related controls are documented
• Metadata becomes an important part of documenting critical
data elements.
built by
QuerySurge™
Data Maturity Model - Process
“Rapidly increasing growth in data volumes, rising regulatory & compliance
mandates, and enhancing strategic risk management & decision-making are
expected to drive the growth of the data governance market.”
The data governance market size is expected to grow from $1.31 Billion in 2018
to $3.53 Billion by 2023, at a CAGR of 22.0%.”
- MarketsAndMarkets.com
a software division ofQuerySurge™
Data Governance - the marketplace
- The Forrester Wave
Leaders in Data Governance
The Data World Distilled
Data Warehouse
ETL
Data Mart
ETL
Source Data Big Data lake BI & Analytics
Source types
• Flat files
• Excel
• json
• Xml
• Web services
• databases
ETL Vendors
• Ab Initio
• IBM
• Informatica
• Microsoft
• Oracle
• SAP
• SAS
• Talend
Hadoop Vendors
• Amazon
• Cloudera
• Hortonworks
• IBM
• MAPR
• Microsoft
NoSQL Vendors
• Amazon
• Apache
• Cassandra
• Couchbase
• MongoDB
• Oracle
Data Warehouse Vendors
• Amazon
• IBM
• Microsoft
• Micro Focus
• Oracle
• SAP
• Snowflake
• Teradata
BI Vendors
• IBM
• Microsoft
• Microstrategy
• Qlik
• Tableau
• SAP
• Oracle
Data Quality
• Informatica
• IBM
• Oracle
• SAP
• SAS
• Talend
Data Testing
• QuerySurge
• Informatica
• Tricentis
• Data Gaps
• IceDQ
• Bitwise
Data Governance
• Collibra
• DATUM
• GDE
• IBM
• Informatica
• SAP
the Data World by Top Vendors
built by
The Data World Distilled:
Understanding how the data world works in the Big Data era
Any questions?
Bill Hayduk
Founder, CEO
QuerySurge™

Weitere ähnliche Inhalte

Was ist angesagt?

QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
RTTS
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
RTTS
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Codecamp Romania
 
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madison
Terry Bunio
 

Was ist angesagt? (20)

Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madison
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 

Ähnlich wie the Data World Distilled

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Jane Roberts
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 

Ähnlich wie the Data World Distilled (20)

Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLake
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 

Mehr von RTTS

RTTS - the Software Quality Experts
RTTS - the Software Quality ExpertsRTTS - the Software Quality Experts
RTTS - the Software Quality Experts
RTTS
 

Mehr von RTTS (10)

Automated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
 
QuerySurge AI webinar
QuerySurge AI webinarQuerySurge AI webinar
QuerySurge AI webinar
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
 
RTTS Postman and API Testing Webinar Slides.pdf
RTTS Postman and API Testing Webinar  Slides.pdfRTTS Postman and API Testing Webinar  Slides.pdf
RTTS Postman and API Testing Webinar Slides.pdf
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
 
Case study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriverCase study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriver
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumEnterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
 
RTTS - the Software Quality Experts
RTTS - the Software Quality ExpertsRTTS - the Software Quality Experts
RTTS - the Software Quality Experts
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

the Data World Distilled

  • 1. Bill Hayduk Founder, CEO a software division ofQuerySurge™ The Data World Distilled: Understanding how the data world works in the Big Data era
  • 2. QuerySurge™ About FACTS RTTS Founded: 1996 Location: New York, NY (Headquarters) Customer profile: Fortune 1000 Software Offering QuerySurge (2012) QuerySurge Partners: • 11 industry-leading Technology Partners • 14 global System Integrators • 22 regional consulting firms RTTS is the parent company of QuerySurge and began as a consulting firm centered on QA & testing a software division of Technology Partners System Integrators Sales & Consulting Partners
  • 3. Data Warehouse Marketplace “the worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2019” - Forrester Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon Business Intelligence Marketplace “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” - Gartner SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders DWH, BI, Big Data Marketplaces a software division ofQuerySurge™ Big Data Marketplace “By the end of 2020, companies will spend > USD $72 billion on on Big Data hardware, software, & professional services” - IDC Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata, SAP, MongoDB, MapR, DataStax, Snowflake.
  • 4. Fast Facts about Data • By the end of 2020, companies will spend > USD $72 billion on Big Data hardware, software, & professional services (the current market size is USD $46 billion) • > 75% of companies are investing or planning to invest in Big Data in the next 2 years • Professional services represents 43% of the Big Data market (services=USD $31 Billion of $72 Billion) a software division ofQuerySurge™
  • 5. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 6. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 7. What is Big Data? a software division ofQuerySurge™
  • 8. Big Data: defined as too much volume, velocity and variability to work on normal database architectures. “The market for big data is $70 billion and growing by 15% a year.” - EMC COO Pat Gelsinger Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes a software division ofQuerySurge™ What is Big Data?
  • 9. Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others a software division ofQuerySurge™ the Big Data Impact
  • 10. What is ? • easily deals with complexities of high of data Hadoop is an open source project that develops software for scalable, distributed computing. • is a of large data sets across clusters of computers using simple programming models. from single servers to 1,000’s of machines, each offering local computation and storage. • detects and at the application layer a software division ofQuerySurge™
  • 11. • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware a software division ofQuerySurge™ Key Attributes of Hadoop
  • 12. Top Vendors built by QuerySurge™ ““By the end of 2020, companies will spend more than USD $72 billion on on Big Data hardware, software, & professional services” - IDC
  • 13. MapReduce (Task Tracker) HDFS (Data Node) MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) machine a software division ofQuerySurge™ Basic Hadoop Architecture
  • 14. Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Name Node a software division ofQuerySurge™ Basic Hadoop Architecture(continued)
  • 15. MapReduce (Task Tracker) HDFS (Data Node)HiveQLHiveQL HiveQLHiveQL HiveQL Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files • create • insert • update • delete • select a software division ofQuerySurge™ Apache Hive
  • 16. What is NoSQL? A term used to describe high-performance, non-relational databases that provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases NoSQL Database Types Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph. Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality. Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. a software division ofQuerySurge™ About
  • 18. built by QuerySurge™ • Online real-time processing • Data set is smaller • Measured in milliseconds • Offline big data processing • Offline analytics • Measured in minutes & hours Source: classpattern.com When to use NoSQL? / When to use Hadoop? NoSQL versus Hadoop
  • 19. built by QuerySurge™ Source: MongoDB, Inc. Data Warehouse Batch Aggregation ETL from MongoDB ETL to MongoDB NoSQL Example: Use Cases
  • 20. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 21. a software division ofQuerySurge™ What is a Data Warehouse?
  • 22. Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis and security purposes. • contains either raw data or formatted data • combines data from multiple sources • Sales • salaries • operational data • human resource data • inventory data • web logs • Social networks • Internet text and docs • other Legacy DB CRM/ERP DB Finance DB a software division ofQuerySurge™ What is a Data Warehouse?
  • 23. “The worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2019” - Forrester Data Warehouse size Small data warehouses: < 5 TB Midsize data warehouses: 5 TB - 20 TB Large data warehouses: >20 TB - Analyst firm Gartner Leaders in on-premises Data Warehouse Data Management Systems - Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’a software division ofQuerySurge™ Data Warehouse - the marketplace
  • 24. Alternate Delivery Models a software division ofQuerySurge™ Data Warehouse - the marketplace Leading Cloud DWHs Oracle founder Larry Ellison with an Exadata appliance Leading Appliance DWHs An appliance is software and servers optimized together.
  • 25. Why build a Data Warehouse? • Data stored in operational systems (OLTP) not easily accessible • OLTP systems are not designed for end-user analysis • The data in OLTP is constantly changing • May be deficient in historical data • Diverse forms of data stored in different platforms and/or dissimilar formats a software division ofQuerySurge™ Data Warehouse - Business Case
  • 26. The Data Warehouse Business Solution • Collects data from different sources (other databases, files, web services, etc) • Integrates data into logical business areas • Provides direct access to data with powerful reporting tools (BI) a software division ofQuerySurge™ Data Warehouse - Business Case
  • 27. The Data Warehouse data • Subject-oriented • Integrated • Non-volatile • Time-variant a software division ofQuerySurge™ Data Warehouse - about the data
  • 28. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL / Data Integration BI & Analytics Data Governance The Data World Distilled
  • 29. ETL = Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. a software division ofQuerySurge™ Data Integration & the ETL process Extract - data from one or more OLTP systems and copy into the warehouse Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Load – map the data, transform and/or load it into the DWH. The ETL function is either performed by home-grown software that someone wrote or through commercial software
  • 30. Legacy DB CRM/ERP DB Finance DB the ETL process Source Data ETL Process Target DWH a software division ofQuerySurge™ Extract Transform Load
  • 31. Leaders in ETL Solutions a software division ofQuerySurge™ Continuous Integration/ETL solutions - the Marketplace (ab initio)
  • 32. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 33. a software division ofQuerySurge™ Business Intelligence (BI)
  • 34. Business Intelligence – What is it? • Software applications used in spotting, digging-out, and analyzing business data • BI provides simple access to data which can be used in day to day operations, integrates data into logical business areas • BI provides historical, current and predictive views of business operations • BI is made up of several related activities, including data mining, online analytical processing, querying and reporting. a software division ofQuerySurge™ Business Intelligence (BI) Business Intelligence software is like reporting engines on steroids
  • 35. “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” “The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to consolidate the market, owning 59 percent of the market share. ” - Analyst firm Gartner a software division ofQuerySurge™ BI & Analytics - the marketplace - Analyst firm Forrester Research’s ‘Forrester Wave’ Leaders in BI
  • 36. Wal-Mart uses vast amounts of data and category analysis to dominate the industry. Amazon and Yahoo follow a "test and learn" approach to business changes. Hardee’s, Wendy’s, and T.G.I. Friday’s use BI to make strategic decisions. a software division ofQuerySurge™ Business Intelligence (BI) - Who uses it?
  • 37. Data Mart A database that has the same characteristics as a data warehouse, but is usually smaller and is focused on the data for one division or one workgroup within an enterprise. Typically hold aggregated data and some granular data. It is a subset of the DWH and makes it more efficient for Business Intelligence reporting. BI tools sit on top of the data marts. a software division ofQuerySurge™ Business Intelligence (BI) & Data Marts Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DW ETL Process Data Mart
  • 38. Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DWH ETL Process a software division ofQuerySurge™ Business Intelligence (BI) & Analytics Data Mart
  • 39. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 40. built byQuerySurge™ Data Quality Best Practices boost revenue by 66%. 46% of companies cite Data Quality as a barrier for adopting Business Intelligence products. 80% of organizations… will underestimate the costs related to the data acquisition tasks by an average of 50 percent. Data Quality Issues The average organization loses $14.2 million annually through poor Data Quality.
  • 41. o Profiling o Parsing and standardization o Generalized Cleansing o Matching o Monitoring o Enrichment o Subject-area-specific support o Metadata management o Configuration environment Data Quality QuerySurge™ Primary Characteristics of Data Quality tools courtesy of Gartner’s “Magic Quadrant for Data Quality Tools” a software division of
  • 42. “The market for data quality software tools reached $1.61 billion in 2017 (the most recent year for which Gartner has data), an increase of 11.6% over 2016. Gartner’s interactions with clients also indicate that demand remains high.” - Analyst firm Gartner a software division ofQuerySurge™ Data Quality - the marketplace - Analyst firm Gartner’s Magic Quadrant Leaders in Data Quality
  • 43. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 44. o Profiling o Parsing and standardization o Generalized Cleansing o Matching o Monitoring o Enrichment o Subject-area-specific support o Metadata management o Configuration environment Data Quality vs. Data Testing QuerySurge™ ▪ Data Completeness ▪ Data Transformation ▪ Regression Testing ▪ Reporting Primary Characteristics of Data Quality tools courtesy of Gartner’s “Magic Quadrant for Data Quality Tools” Data Verification & Validation? Primary Characteristics of Data Testing tools Courtesy of the book "Testing the Data Warehouse Practicum" Data Verification & Validation? a software division of
  • 45. a software division ofQuerySurge™ Where Data Testing fits in your data strategy
  • 46. Business Intelligence & Analytics CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine. “The average organization loses $14.2 million annually through poor Data Quality.” - Gartner Data Architecture The Executive Office and Critical Data Typical data issue areas ETL Mainframe
  • 47. Data Analyst: Creates data requirements (source-to- target map or mapping doc) Data Architect: Models and builds data store (Big Data lake, Data Warehouse, etc.) ETL Developer: Transforms and loads data from sources to target data stores Data Tester: Validates the data, based on mappings, as it moves and transforms from sources to targets Key Roles in Building & Testing a Data Store a software division ofQuerySurge™
  • 48. a.k.a. Source-to-Target Map It’s the critical element required to efficiently plan the target Data Stores. It also defines the Extract, Transform, Load (ETL) process. Intention: ✓ capture business rules ✓ data flow mapping and ✓ data movement requirements. Mapping Doc specifies: ▪ Source input definition ▪ Target/output details ▪ Business & data transformation rules ▪ Absolute data quality requirements ▪ Optional data quality requirements. a software division ofQuerySurge™ Data Requirements = Mapping Document
  • 49. Sampling • Review Business Rules (i.e. mapping document, data flow mappings) • Write Tests in SQL editor • Execute 2 Tests: 1 at Source & 1 at Target • Export results to 2 Excel files • Compare a Sampling of results by eye (‘Stare & Compare’) Issue with Stare & Compare: Impossible to visually compare billions of data sets Result: usually less than 1% of data is compared Example - Current QuerySurge customer • one test = 100 million rows X 200 columns = 20 billion data sets • there is no practical way to manually verify (eyeball) this data set • the client has more than 15,000 total tests a software division of Most Common Data Validation Method QuerySurge™
  • 50. Huge Risk Roles Tasks Timeline Data Analyst Data Architect ETL Developer Data Tester Model and build target Data Stores Review Mapping Document Maintain Target Data Stores Create 2 SQL tests for each mapping with SQL editor Review Mapping Document Dump results of tests to 2 Excel files Compare Excel files by eye Execute tests Determine Requirements Create & maintain Mapping Document iterate iterate Data Store Roles, Tasks, & Timelines Review Mapping Document Extract & load data or extract, transform, & load data Build data movement logic
  • 51. a software division of About QuerySurge QuerySurge™
  • 52. is the leading testing solution for automated validation & testing of Big Data QuerySurge Use Cases a software division of What is QuerySurge? a software division ofQuerySurge™
  • 53. QuerySurge connects to any 2 points at one time How QuerySurge Works SQL HQL SQL Comparison of every data set Source Data Target Data Data Intelligence Reports, Data Health Dashboard, automated email reports Results – pass/fail Target Data Big Data stores • Hadoop • NoSQL Data Warehouses XML Web Services Source Data Data Stores • Databases • Data Warehouses • Data Marts Flat Files • Fixed Width • Delimited • Excel • JSON Business Intelligence Reports
  • 54. ETL Developer: Codes data movement based on Mapping Requirements Data Warehouse ETL Data Tester: Tests data movement based on Mapping Requirements Data Mart ETL Source Data Big Data lake Testing Point #1 Testing Point #2 Testing Points #3 BI & Analytics BI Analyst extracts data for reports Testing Point #4 Tester tests BI Reports Big Data Process - Developer & Tester
  • 55. QuerySurge supports the following data stores… • Amazon Redshift, Elastic Map Reduce, DynamoDB • Apache Hadoop/Hive, Spark • Cassandra • Cloudera • Couchbase • Exasol • Flat Files (delimited, fixed-width) • Hortonworks • IBM (Db2, Netezza, Informix, Big Insights, Cloudant, MDM, Cognos) • JSON files • Mainframe • MAPR • Micro Focus Vertica • Microsoft (SQL Server DWH, HDInsight, PDW, SSAS, Excel, Access, SharePoint) • MongoDB • Oracle (Oracle DB, MySQL, Exadata, NoSQL, Hadoop) • Pivotal GreenPlum • PostgreSQL • Salesforce • SAP (HANA, IQ, ASE, SQL Anywhere, Altiscale Data Cloud) • Snowflake • Tableau • Teradata, Aster • Workday • XML …and any other data store QuerySurge Supports 50+ Data Stores Flat Files Excel
  • 56. Data Warehouse Data Quality Data Testing Big Data a software division ofQuerySurge™ ETL/ Data Integration BI & Analytics Data Governance The Data World Distilled
  • 57. 1) Data stewardship Identifying and assigning roles and responsibilities. - who is creating its data, - who has overall responsibility for the data, - who uses the data, who routes it, - who oversees its use. 2) Data classification Identify and categorize data types into groups. 3) Data quality Data quality - the process of measuring the reliability of current data sets to provide information that can be used to make organizational decisions. 4) Data management Process where all the organization's data governance efforts come together. The company actively manages its data governance efforts and involves the creation of the architectures and business processes required to properly maintain the organization’s data through its full lifecycle. 4 main components of successful data governance
  • 58. source: IBM Data Governance Council Maturity Model • Patterned after the Capability Maturity Model Integration(CMMI) from the Software Engineering Institute (SEI) at Carnegie Mellon University • Devised by IBM, along with 55 other companies • Few stable processes exist • “Just do it” mentality • Data-related policies become more clear & reflect the organization’s data principles. • Data integration opportunities are better leveraged. • Risk assessment for data integrity & quality becomes part of the organization’s project methodology. • Further defined value of data for more data elements • Data Governance methodology is introduced during the planning stages of new projects • Enterprise data models are documented & published • Data Governance is second nature • ROI for data-related projects is tracked • Business value of data mgmt is recognized • Cost of data mgmt is easier to manage • Costs are reduced as processes become automated • More data-related controls are documented • Metadata becomes an important part of documenting critical data elements. built by QuerySurge™ Data Maturity Model - Process
  • 59. “Rapidly increasing growth in data volumes, rising regulatory & compliance mandates, and enhancing strategic risk management & decision-making are expected to drive the growth of the data governance market.” The data governance market size is expected to grow from $1.31 Billion in 2018 to $3.53 Billion by 2023, at a CAGR of 22.0%.” - MarketsAndMarkets.com a software division ofQuerySurge™ Data Governance - the marketplace - The Forrester Wave Leaders in Data Governance
  • 60. The Data World Distilled
  • 61. Data Warehouse ETL Data Mart ETL Source Data Big Data lake BI & Analytics Source types • Flat files • Excel • json • Xml • Web services • databases ETL Vendors • Ab Initio • IBM • Informatica • Microsoft • Oracle • SAP • SAS • Talend Hadoop Vendors • Amazon • Cloudera • Hortonworks • IBM • MAPR • Microsoft NoSQL Vendors • Amazon • Apache • Cassandra • Couchbase • MongoDB • Oracle Data Warehouse Vendors • Amazon • IBM • Microsoft • Micro Focus • Oracle • SAP • Snowflake • Teradata BI Vendors • IBM • Microsoft • Microstrategy • Qlik • Tableau • SAP • Oracle Data Quality • Informatica • IBM • Oracle • SAP • SAS • Talend Data Testing • QuerySurge • Informatica • Tricentis • Data Gaps • IceDQ • Bitwise Data Governance • Collibra • DATUM • GDE • IBM • Informatica • SAP the Data World by Top Vendors built by
  • 62. The Data World Distilled: Understanding how the data world works in the Big Data era Any questions? Bill Hayduk Founder, CEO QuerySurge™