Bill Hayduk is the founder and CEO of QuerySurge, a software division that provides data integration and analytics solutions, with headquarters in New York; QuerySurge was founded in 1996 and has grown to serve Fortune 1000 customers through partnerships with technology companies and consulting firms. The document discusses the data and analytics marketplace and provides an overview of concepts like data warehousing, ETL, BI, data quality, data testing, big data, Hadoop, and NoSQL.
1. Bill Hayduk
Founder, CEO
a software division ofQuerySurge™
The Data World Distilled:
Understanding how the data world works in the Big Data era
2. QuerySurge™
About
FACTS
RTTS Founded:
1996
Location:
New York, NY
(Headquarters)
Customer profile:
Fortune 1000
Software Offering
QuerySurge (2012)
QuerySurge Partners:
• 11 industry-leading
Technology Partners
• 14 global System
Integrators
• 22 regional consulting
firms
RTTS is the parent company of QuerySurge and began as a
consulting firm centered on QA & testing
a software division of
Technology Partners
System Integrators
Sales & Consulting
Partners
3. Data Warehouse Marketplace
“the worldwide data warehouse management software market is forecast
to generate nearly $17 billion in revenue by 2019” - Forrester
Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon
Business Intelligence Marketplace
“The business intelligence (BI) and analytics software market is forecast to grow to
$22.8 billion by the end of 2020” - Gartner
SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders
DWH, BI, Big Data Marketplaces
a software division ofQuerySurge™
Big Data Marketplace
“By the end of 2020, companies will spend > USD $72 billion on on Big Data
hardware, software, & professional services” - IDC
Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata,
SAP, MongoDB, MapR, DataStax, Snowflake.
4. Fast Facts about Data
• By the end of 2020, companies will spend > USD $72 billion
on Big Data hardware, software, & professional services
(the current market size is USD $46 billion)
• > 75% of companies are investing or planning to invest in
Big Data in the next 2 years
• Professional services represents 43% of the Big Data market
(services=USD $31 Billion of $72 Billion)
a software division ofQuerySurge™
7. What is Big Data?
a software division ofQuerySurge™
8. Big Data: defined as too much volume, velocity and variability to work on normal
database architectures.
“The market for big data is $70 billion and growing by
15% a year.”
- EMC COO Pat Gelsinger
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
a software division ofQuerySurge™
What is Big Data?
9. Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
a software division ofQuerySurge™
the Big Data Impact
10. What is ?
• easily deals with complexities of high of data
Hadoop is an open source project that develops software for scalable,
distributed computing.
• is a of large data sets across
clusters of computers using simple programming models.
from single servers to 1,000’s of machines, each offering local
computation and storage.
• detects and at the application layer
a software division ofQuerySurge™
11. • Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
a software division ofQuerySurge™
Key Attributes of Hadoop
12. Top Vendors
built by
QuerySurge™
““By the end of 2020, companies will spend more than USD $72 billion on
on Big Data hardware, software, & professional services” - IDC
13. MapReduce
(Task Tracker)
HDFS
(Data
Node)
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
machine
a software division ofQuerySurge™
Basic Hadoop Architecture
14. Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node
a software division ofQuerySurge™
Basic Hadoop Architecture(continued)
15. MapReduce
(Task Tracker)
HDFS
(Data
Node)HiveQLHiveQL
HiveQLHiveQL
HiveQL
Apache Hive - a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
• create
• insert
• update
• delete
• select
a software division ofQuerySurge™
Apache Hive
16. What is NoSQL?
A term used to describe high-performance, non-relational databases that provide a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational databases
NoSQL Database Types
Document databases pair each key with a complex data structure known as a document. Documents can contain
many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections. Graph stores
include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute
name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value
stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store
columns of data together, instead of rows.
a software division ofQuerySurge™
About
18. built by
QuerySurge™
• Online real-time processing
• Data set is smaller
• Measured in milliseconds
• Offline big data processing
• Offline analytics
• Measured in minutes & hours
Source: classpattern.com
When to use NoSQL? / When to use Hadoop?
NoSQL versus Hadoop
22. Data Warehouse
• typically a relational database that is designed for query and analysis
rather than for transaction processing
• a place where historical data is stored for archival, analysis and
security purposes.
• contains either raw data or formatted data
• combines data from multiple sources
• Sales
• salaries
• operational data
• human resource data
• inventory data
• web logs
• Social networks
• Internet text and docs
• other
Legacy DB
CRM/ERP
DB
Finance DB
a software division ofQuerySurge™
What is a Data Warehouse?
23. “The worldwide data warehouse management software market is
forecast to generate nearly $17 billion in revenue by 2019”
- Forrester
Data Warehouse size
Small data warehouses: < 5 TB
Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm Gartner
Leaders in on-premises Data Warehouse Data Management Systems
- Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’a software division ofQuerySurge™
Data Warehouse - the marketplace
24. Alternate Delivery Models
a software division ofQuerySurge™
Data Warehouse - the marketplace
Leading Cloud DWHs
Oracle founder Larry Ellison with an
Exadata appliance
Leading Appliance DWHs
An appliance is
software and
servers
optimized
together.
25. Why build a Data Warehouse?
• Data stored in operational systems (OLTP) not
easily accessible
• OLTP systems are not designed for end-user
analysis
• The data in OLTP is constantly changing
• May be deficient in historical data
• Diverse forms of data stored in different platforms
and/or dissimilar formats
a software division ofQuerySurge™
Data Warehouse - Business Case
26. The Data Warehouse Business Solution
• Collects data from different sources (other
databases, files, web services, etc)
• Integrates data into logical business areas
• Provides direct access to data with powerful
reporting tools (BI)
a software division ofQuerySurge™
Data Warehouse - Business Case
27. The Data Warehouse data
• Subject-oriented
• Integrated
• Non-volatile
• Time-variant
a software division ofQuerySurge™
Data Warehouse - about the data
29. ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it can serve its
purpose of facilitating business analysis.
a software division ofQuerySurge™
Data Integration & the ETL process
Extract - data from one or more OLTP systems and copy
into the warehouse
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Load – map the data, transform and/or load it into the DWH.
The ETL function is either performed by home-grown software that someone wrote or
through commercial software
34. Business Intelligence – What is it?
• Software applications used in spotting, digging-out, and
analyzing business data
• BI provides simple access to data which can be used in day
to day operations, integrates data into logical business areas
• BI provides historical, current and predictive views of
business operations
• BI is made up of several related activities, including data
mining, online analytical processing, querying and reporting.
a software division ofQuerySurge™
Business Intelligence (BI)
Business Intelligence software is like reporting engines on steroids
35. “The business intelligence (BI) and analytics software market is forecast to
grow to $22.8 billion by the end of 2020”
“The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to
consolidate the market, owning 59 percent of the market share. ”
- Analyst firm Gartner
a software division ofQuerySurge™
BI & Analytics - the marketplace
- Analyst firm Forrester Research’s ‘Forrester Wave’
Leaders in BI
36. Wal-Mart uses vast amounts of data and category analysis to
dominate the industry.
Amazon and Yahoo follow a "test and learn" approach to
business changes.
Hardee’s, Wendy’s, and T.G.I. Friday’s use BI
to make strategic decisions.
a software division ofQuerySurge™
Business Intelligence (BI) - Who uses it?
37. Data Mart
A database that has the same characteristics as a data warehouse, but is
usually smaller and is focused on the data for one division or one
workgroup within an enterprise.
Typically hold aggregated data and some granular data. It is a subset of the
DWH and makes it more efficient for Business Intelligence reporting. BI tools
sit on top of the data marts.
a software division ofQuerySurge™
Business Intelligence (BI) & Data Marts
Legacy DB
CRM/ERP DB
Finance DB
Source Data ETL Process Target DW ETL Process Data Mart
38. Legacy DB
CRM/ERP
DB
Finance DB
Source Data
ETL Process
Target DWH
ETL Process
a software division ofQuerySurge™
Business Intelligence (BI) & Analytics
Data Mart
40. built byQuerySurge™
Data Quality Best Practices boost revenue by 66%.
46% of companies cite Data Quality as a barrier for
adopting Business Intelligence products.
80% of organizations… will underestimate the costs related
to the data acquisition tasks by an average of 50 percent.
Data Quality Issues
The average organization loses $14.2 million annually
through poor Data Quality.
41. o Profiling
o Parsing and standardization
o Generalized Cleansing
o Matching
o Monitoring
o Enrichment
o Subject-area-specific support
o Metadata management
o Configuration environment
Data Quality
QuerySurge™
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
a software division of
42. “The market for data quality software tools reached $1.61 billion in 2017 (the
most recent year for which Gartner has data), an increase of 11.6% over 2016.
Gartner’s interactions with clients also indicate that demand remains high.”
- Analyst firm Gartner
a software division ofQuerySurge™
Data Quality - the marketplace
- Analyst firm Gartner’s Magic Quadrant
Leaders in Data Quality
44. o Profiling
o Parsing and standardization
o Generalized Cleansing
o Matching
o Monitoring
o Enrichment
o Subject-area-specific support
o Metadata management
o Configuration environment
Data Quality vs. Data Testing
QuerySurge™
▪ Data Completeness
▪ Data Transformation
▪ Regression Testing
▪ Reporting
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
Data
Verification &
Validation?
Primary Characteristics of Data Testing tools
Courtesy of the book "Testing the Data Warehouse Practicum"
Data
Verification &
Validation?
a software division of
45. a software division ofQuerySurge™
Where Data Testing fits in your data strategy
46. Business Intelligence & Analytics
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
“The average organization loses
$14.2 million annually through
poor Data Quality.”
- Gartner
Data Architecture
The Executive Office and Critical Data
Typical data issue
areas
ETL
Mainframe
47. Data Analyst: Creates data requirements (source-to-
target map or mapping doc)
Data Architect: Models and builds data store (Big Data
lake, Data Warehouse, etc.)
ETL Developer: Transforms and loads data from
sources to target data stores
Data Tester: Validates the data, based on mappings,
as it moves and transforms from sources to targets
Key Roles in Building & Testing a Data Store
a software division ofQuerySurge™
48. a.k.a. Source-to-Target Map
It’s the critical element required
to efficiently plan the target Data
Stores. It also defines the Extract,
Transform, Load (ETL) process.
Intention:
✓ capture business rules
✓ data flow mapping and
✓ data movement requirements.
Mapping Doc specifies:
▪ Source input definition
▪ Target/output details
▪ Business & data transformation rules
▪ Absolute data quality requirements
▪ Optional data quality requirements.
a software division ofQuerySurge™
Data Requirements = Mapping Document
49. Sampling
• Review Business Rules (i.e. mapping document, data flow mappings)
• Write Tests in SQL editor
• Execute 2 Tests: 1 at Source & 1 at Target
• Export results to 2 Excel files
• Compare a Sampling of results by eye (‘Stare & Compare’)
Issue with Stare & Compare:
Impossible to visually compare billions of data sets
Result: usually less than 1% of data is compared
Example - Current QuerySurge customer
• one test = 100 million rows X 200 columns = 20 billion data sets
• there is no practical way to manually verify (eyeball) this data set
• the client has more than 15,000 total tests
a software division of
Most Common Data Validation Method
QuerySurge™
50. Huge Risk
Roles Tasks
Timeline
Data Analyst
Data Architect
ETL Developer
Data Tester
Model and
build target
Data Stores
Review
Mapping
Document
Maintain
Target Data
Stores
Create 2 SQL
tests for each
mapping with
SQL editor
Review
Mapping
Document
Dump
results of
tests to 2
Excel files
Compare
Excel files
by eye
Execute
tests
Determine
Requirements
Create & maintain
Mapping
Document
iterate
iterate
Data Store Roles, Tasks, & Timelines
Review
Mapping
Document
Extract & load data or
extract, transform, &
load data
Build data
movement
logic
52. is the leading testing solution for
automated validation & testing of Big Data
QuerySurge
Use Cases
a software division of
What is QuerySurge?
a software division ofQuerySurge™
53. QuerySurge connects
to any 2 points
at one time
How QuerySurge Works
SQL
HQL
SQL
Comparison of every data set
Source
Data
Target
Data
Data Intelligence Reports,
Data Health Dashboard,
automated email reports
Results – pass/fail
Target Data
Big Data
stores
• Hadoop
• NoSQL
Data
Warehouses
XML
Web Services
Source Data
Data Stores
• Databases
• Data Warehouses
• Data Marts
Flat Files
• Fixed Width
• Delimited
• Excel
• JSON
Business Intelligence
Reports
54. ETL Developer: Codes data movement based on Mapping Requirements
Data Warehouse
ETL
Data Tester: Tests data movement based on Mapping Requirements
Data Mart
ETL
Source Data Big Data lake
Testing Point #1 Testing Point #2 Testing Points #3
BI & Analytics
BI Analyst extracts
data for reports
Testing Point #4
Tester tests BI
Reports
Big Data Process - Developer & Tester
55. QuerySurge supports the following data stores…
• Amazon Redshift, Elastic Map Reduce, DynamoDB
• Apache Hadoop/Hive, Spark
• Cassandra
• Cloudera
• Couchbase
• Exasol
• Flat Files (delimited, fixed-width)
• Hortonworks
• IBM (Db2, Netezza, Informix, Big Insights, Cloudant, MDM, Cognos)
• JSON files
• Mainframe
• MAPR
• Micro Focus Vertica
• Microsoft (SQL Server DWH, HDInsight, PDW, SSAS, Excel, Access,
SharePoint)
• MongoDB
• Oracle (Oracle DB, MySQL, Exadata, NoSQL, Hadoop)
• Pivotal GreenPlum
• PostgreSQL
• Salesforce
• SAP (HANA, IQ, ASE, SQL Anywhere, Altiscale Data Cloud)
• Snowflake
• Tableau
• Teradata, Aster
• Workday
• XML …and any other data store
QuerySurge Supports 50+ Data Stores
Flat Files
Excel
57. 1) Data stewardship
Identifying and assigning roles and responsibilities.
- who is creating its data,
- who has overall responsibility for the data,
- who uses the data, who routes it,
- who oversees its use.
2) Data classification
Identify and categorize data types into groups.
3) Data quality
Data quality - the process of measuring the reliability of current data sets to provide
information that can be used to make organizational decisions.
4) Data management
Process where all the organization's data governance efforts come together. The
company actively manages its data governance efforts and involves the creation of the
architectures and business processes required to properly maintain the organization’s
data through its full lifecycle.
4 main components of successful data governance
58. source: IBM Data Governance Council Maturity Model
• Patterned after the Capability
Maturity Model
Integration(CMMI) from the
Software Engineering Institute
(SEI) at Carnegie Mellon
University
• Devised by IBM, along with 55
other companies
• Few stable processes exist
• “Just do it” mentality
• Data-related policies become more clear & reflect the
organization’s data principles.
• Data integration opportunities are better leveraged.
• Risk assessment for data integrity & quality becomes part of the
organization’s project methodology.
• Further defined value of data for more data elements
• Data Governance methodology is introduced during the
planning stages of new projects
• Enterprise data models are documented & published
• Data Governance is second nature
• ROI for data-related projects is tracked
• Business value of data mgmt is recognized
• Cost of data mgmt is easier to manage
• Costs are reduced as processes become
automated
• More data-related controls are documented
• Metadata becomes an important part of documenting critical
data elements.
built by
QuerySurge™
Data Maturity Model - Process
59. “Rapidly increasing growth in data volumes, rising regulatory & compliance
mandates, and enhancing strategic risk management & decision-making are
expected to drive the growth of the data governance market.”
The data governance market size is expected to grow from $1.31 Billion in 2018
to $3.53 Billion by 2023, at a CAGR of 22.0%.”
- MarketsAndMarkets.com
a software division ofQuerySurge™
Data Governance - the marketplace
- The Forrester Wave
Leaders in Data Governance
61. Data Warehouse
ETL
Data Mart
ETL
Source Data Big Data lake BI & Analytics
Source types
• Flat files
• Excel
• json
• Xml
• Web services
• databases
ETL Vendors
• Ab Initio
• IBM
• Informatica
• Microsoft
• Oracle
• SAP
• SAS
• Talend
Hadoop Vendors
• Amazon
• Cloudera
• Hortonworks
• IBM
• MAPR
• Microsoft
NoSQL Vendors
• Amazon
• Apache
• Cassandra
• Couchbase
• MongoDB
• Oracle
Data Warehouse Vendors
• Amazon
• IBM
• Microsoft
• Micro Focus
• Oracle
• SAP
• Snowflake
• Teradata
BI Vendors
• IBM
• Microsoft
• Microstrategy
• Qlik
• Tableau
• SAP
• Oracle
Data Quality
• Informatica
• IBM
• Oracle
• SAP
• SAS
• Talend
Data Testing
• QuerySurge
• Informatica
• Tricentis
• Data Gaps
• IceDQ
• Bitwise
Data Governance
• Collibra
• DATUM
• GDE
• IBM
• Informatica
• SAP
the Data World by Top Vendors
built by
62. The Data World Distilled:
Understanding how the data world works in the Big Data era
Any questions?
Bill Hayduk
Founder, CEO
QuerySurge™