SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Grab some coffee and enjoy 
the pre-show banter before 
the top of the hour!
Hadoop 2.0: Solving the Data Quality Challenge 
The Briefing Room
Twitter Tag: #briefr 
The Briefing Room 
Welcome 
Host: 
Eric Kavanagh 
eric.kavanagh@bloorgroup.com 
@eric_kavanagh
! Reveal the essential characteristics of enterprise software, 
good and bad 
! Provide a forum for detailed analysis of today’s innovative 
technologies 
! Give vendors a chance to explain their product to savvy 
analysts 
! Allow audience members to pose serious questions... and get 
answers! 
Twitter Tag: #briefr 
The Briefing Room 
Mission
This Month: INNOVATIVE TECHNOLOGY 
August: BIG DATA ECOSYSTEM 
September: INTEGRATION 
Twitter Tag: #briefr 
The Briefing Room 
Topics 
2014 Editorial Calendar at 
www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr 
The Briefing Room 
Analyst: Dr. Claudia Imhoff 
Claudia Imhoff is 
President & Founder of 
Intelligent Solutions, Inc.
Twitter Tag: #briefr 
The Briefing Room 
RedPoint Global 
! RedPoint Global is a data management and integrated 
marketing technology company 
! Its Convergent Marketing Platform™ offers products 
designed for data management, collaboration and 
architecture integration. 
! RedPoint Data Management for Hadoop is YARN-compliant 
and enables analysts to access and manipulate data directly 
within the Hadoop cluster.
Twitter Tag: #briefr 
The Briefing Room 
Guest: George Corugedo 
George Corugedo is Chief Technology Officer & Co- 
Founder at RedPoint Global Inc. A mathematician 
and seasoned technology executive, George has 
over 20 years of business and technical expertise. 
As co-founder and CTO of RedPoint Global, George 
is responsible for leading the development of the 
RedPoint Convergent Marketing Platform™. A 
former math professor, George left academia to 
co-found Accenture’s Customer Insight Practice, 
which specialized in strategic data utilization, 
analytics and customer strategy. Previous positions 
include director of client delivery at ClarityBlue, 
Inc., a provider of hosted customer intelligence 
solutions to enterprise commercial entities, and 
COO/CIO of Riscuity, a receivables management 
company specializing in the utilization of analytics 
to drive collections.
The Neglected Discipline of Data Quality in Hadoop 
July 
2014
Overview – Challenges to Adoption 
• Severe 
Skills 
Gap 
shortage 
of 
MR 
skilled 
resources 
• Very 
expensive 
resources 
and 
hard 
to 
retain 
• Inconsistent 
skills 
lead 
to 
inconsistent 
results 
• Under 
uAlizes 
exisAng 
resources 
• Prevents 
broad 
leverage 
of 
investments 
across 
enterprise 
Maturity 
& 
Governance 
• A 
nascent 
technology 
ecosystem 
around 
Hadoop 
• Emerging 
technologies 
only 
address 
narrow 
slivers 
of 
funcAonality 
• New 
applicaAons 
are 
not 
enterprise 
class 
• Legacy 
applicaAons 
have 
built 
short 
term 
capabiliAes 
Data 
Into 
InformaAon 
• Data 
11 © RedPoint Global Inc. 2014 Confidential 
is 
not 
useful 
in 
its 
raw 
state, 
it 
must 
be 
turned 
into 
informaAon 
• Benefit 
of 
Hadoop 
is 
that 
same 
data 
can 
be 
used 
from 
many 
perspecAves 
• Analysts 
must 
now 
do 
the 
structuring 
of 
the 
data 
based 
on 
intended 
use 
of 
the 
data
Key Points to Cover Today 
! Broad functionality across data processing domains 
! Validated ease of use, speed, match quality and party data superiority 
! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so 
! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data 
Management is a pure YARN application (1 of only 2 in the initial wave of 
certifications) 
! Building a complex job in RPDM takes a fraction of the time that it takes to 
write the same job in Map Reduce and none of the coding or java skills. 
! Big functional footprint without touching a line of code 
! Design model consistent with data flow paradigm 
! RPDM has a “Zero-Footprint” install in the Hadoop cluster 
! The same interface and functionality is available for both structured and 
unstructured databases. Thus it is seamless to work across both from a users 
perspective. 
! Data quality done completely within the cluster 
12 © RedPoint Global Inc. 2014 Confidential
Key features of RedPoint Data Management 
ETL 
& 
ELT 
Data 
Quality 
Master 
Key 
Management 
Web 
Services 
IntegraAon 
IntegraAon 
& 
Matching 
Process 
AutomaAon 
13 © RedPoint Global Inc. 2014 Confidential 
& 
OperaAons 
• Profiling, 
reads/writes, 
transformaAons 
• Single 
project 
for 
all 
jobs 
• Cleanse 
data 
• Parsing, 
correcAon 
• Geo-­‐spaAal 
analysis 
• Grouping 
• Fuzzy 
match 
• Create 
keys 
• Track 
changes 
• Maintain 
matches 
over 
Ame 
• Consume 
and 
publish 
• HTTP/HTTPS 
protocols 
• XML/JSON/SOAP 
formats 
• Job 
scheduling, 
monitoring, 
noAficaAons 
• Central 
point 
of 
control 
All 
func(ons 
can 
be 
used 
on 
both 
TRADITIONAL 
and 
BIG 
DATA 
Creates 
clean, 
integrated, 
ac/onable 
data 
– 
quickly, 
reliably 
and 
at 
low 
cost
RedPoint Data Management on Hadoop 
ParAAoning 
AM 
/ 
Tasks 
Parallel 
SecAon 
(UI) 
ExecuAon 
AM 
/ 
Tasks 
Data 
I/O 
Key 
/ 
Split 
Analysis 
YARN 
14 © RedPoint Global Inc. 2014 Confidential 
MapReduce
RedPoint Functional Footprint 
Monitoring and Management Tools 
AMBARI 
DATA REFINEMENT 
PIG HIVE 
MAPREDUCE 
REST 
HTTP 
STREAM 
STRUCTURE 
HCATALOG 
(metadata services) 
DBs 
Fil 
esF il 
Feilse s 
NFS 
Ÿ 
15 © RedPoint Global Inc. 2014 Confidential 
Query/Visualization/ 
Reporting/Analytical 
Tools and Apps 
SOURCE 
DATA 
- Sensor Logs 
- Clickstream 
JMS 
- Flat Queue’s 
Files 
- Unstructured 
- Sentiment 
- Customer 
- Inventory 
Data Sources 
RDBMS 
EDW 
INTERACTIVE 
HIVE Server2 
LOAD 
SQOOP 
WebHDFS 
Flume 
LOAD 
SQOO P/Hive 
Web HDFS 
YARN 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
n 
HDFS 
1 Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ
Sample 
MapReduce 
(small 
subset 
of 
the 
entire 
code 
which 
totals 
nearly 
150 
lines): 
public 
static 
class 
MapClass 
extends 
Mapper<WordOffset, Text, Text, IntWritable> { 
16 © RedPoint Global Inc. 2014 Confidential 
RedPoint 
Benchmarks – Project Gutenberg 
Map 
Reduce 
Pig 
private 
final 
static 
String delimiters = 
"',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; 
private 
final 
static 
IntWritable one = new 
IntWritable(1); 
private 
Text word = new 
Text(); 
public 
void 
map(WordOffset key, Text value, Context context) 
throws 
IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer itr = new 
StringTokenizer(line, delimiters); 
while 
(itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
context.write(word, one); 
} 
} 
} 
Sample 
Pig 
script 
without 
the 
UDF: 
SET 
pig.maxCombinedSplitSize 67108864 
SET 
pig.splitCombination true 
A = LOAD 
'/testdata/pg/*/*/*'; 
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS 
word; 
C = FOREACH B GENERATE UPPER(word) AS 
word; 
D = GROUP 
C BY 
word; 
E = FOREACH D GENERATE COUNT(C) AS 
occurrences, group; 
F = ORDER 
E BY 
occurrences DESC; 
STORE F INTO 
'/user/cleonardi/pg/pig-count'; 
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 
6 hours of development 3 hours of development 15 min. of development 
6 minutes runtime 15 minutes runtime 3 minutes runtime 
Extensive optimization needed User Defined Functions required 
prior to running script 
No tuning or optimization 
required
Attributes of Information 
RELEVANT 
InformaAon 
must 
pertain 
to 
a 
specific 
problem. 
General 
data 
must 
be 
connected 
to 
reveal 
relevance 
of 
the 
informaAon. 
COMPLETE 
ParAal 
informaAon 
is 
oaen 
worse 
than 
no 
informaAon. 
ParAal 
informaAon 
frequently 
leads 
to 
worse 
conclusions 
than 
if 
no 
data 
had 
been 
used 
at 
all. 
ACCURATE 
This 
one 
is 
obvious. 
In 
a 
context 
like 
health 
care, 
inaccurate 
data 
can 
be 
fatal. 
Precision 
is 
required 
across 
all 
applicaAons 
of 
informaAon. 
CURRENT 
As 
data 
ages, 
it 
becomes 
less 
accurate. 
MulAple 
research 
studies 
by 
Google 
and 
others 
show 
the 
decay 
in 
the 
accuracy 
of 
analyAcs 
as 
data 
becomes 
stale. 
ECONOMICAL 
There 
has 
to 
be 
a 
clear 
cost 
benefit. 
This 
requires 
work 
to 
idenAfy 
the 
realizable 
benefit 
of 
informaAon 
but 
this 
is 
also 
what 
rives 
the 
use 
if 
successful 
17 © RedPoint Global Inc. 2014 Confidential
Reference Architecture for Matching in Hadoop 
Data 
Sources 
CRM 
ERP 
Billing 
Subscriber 
Product 
Network 
Weather 
Compete 
Manuf. 
Clickstream 
Online 
Chat 
Sensor 
Data 
Social 
Media 
Call 
Detail 
Records 
FabricaAon 
Logs 
Sales 
Feedback 
Field 
Feedback 
Field 
Feedback 
+ 
18 © RedPoint Global Inc. 2014 Confidential
Resource 
Manager 
19 © RedPoint Global Inc. 2014 Confidential 
Launches 
Tasks 
Node 
Manager 
DM 
App 
Master 
DM 
Task 
Node 
Manager 
DM 
Task 
DM 
Task 
Node 
Manager 
DM 
Task 
DM 
Task 
Launches 
DM 
App 
Master 
Data 
Management 
Designer 
DM 
ExecuAon 
Server 
Parallel 
SecAon 
Running 
DM 
Task 
1 
2 
3 
RedPoint DM for Hadoop: Processing Flow
The Data Management designer 
20 © RedPoint Global Inc. 2014 Confidential
DM Hadoop Settings 
21 © RedPoint Global Inc. 2014 Confidential
DM Parallel Section on Hadoop 
22 © RedPoint Global Inc. 2014 Confidential
Who Should Care 
! Companies interested in exploring the promise of Big 
Data Analytics and need an easy way to get started. 
! Companies already investing heavily investing in Big 
Data Analytics technologies but are stuck due to the 
shortage of skilled resources 
! Large organizations that are focused on “Operational 
Offloading” and need to achieve it cost effectively 
! Companies who recognize that much of the data 
that lands in Hadoop is external to the organization 
and need to have Data Quality and proper data 
23 governance © RedPoint applied Global Inc. 2014 to their Hadoop Confidential 
data.
RedPoint Convergent Marketing Ecosystem 
Data Inputs 
No SQL Social SQL Enhancement 
Mobile Social Digital 
RedPoint Interaction 
Segmentation Inbox Analysis Attribution 
GIS 
Marketing Rules Engine 
CRM Trigger Audience Offer 
RedPoint Data Management 
Machine Learning Analytics 
Email 
Address Std. 
Web Services 
Geocoding 
24 © RedPoint Global Inc. 2014 Confidential 
Real Time 
Cache 
Marketing Operations Analytics Hadoop
RedPoint real-time decisions: how it works 
(web site example) 
RedPoint 
update/ 
maintain 
over 
Ame 
25 © RedPoint Global Inc. 2014 Confidential 
www 
profile 
data 
context 
data 
real-­‐Ame 
profile 
winning 
content 
Machine 
Learning 
rules 
inbound 
personalizaAons 
combined 
with 
outbound 
contacts 
to 
create 
cross-­‐channel 
interacAon 
history 
web 
site 
REDPOINT 
EXECUTION 
ENVIRONMENT 
personalizaAo 
n 
opportunity 
API 
call 
perCsOoNnTaElNizTe 
NdE 
cEoDnEDt 
ent 
content 
candidate 
content 
with 
associated 
eligibility 
& 
scoring 
rules 
content 
stored 
in 
RedPoint, 
or 
RedPoint 
points 
to 
content 
in 
CMS 
or 
other 
system 
API 
Nulla tincidunt dolor sit amet erat. 
Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris 
leo. Aenean vel euismod est. 
Phasellus pretium, sem id varius viverra, nisl elit commodo orci, 
vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate 
bibendum. 
Duis vehicula tellus commodo mauris consequat rutrum eget sit 
amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac 
consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare 
rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, 
accumsan eu sapien. 
1st 
Party 
Customer 
data 
in 
database(s) 
and/or 
Hadoop
RedPoint vs. alternatives 
ü û 
ü û 
ü û 
ü û 
ü û 
ü û 
ü û 
Pure 
YARN, 
no 
MapReduce 
Graphical 
UI, 
not 
code-­‐based 
Top 
rated 
for 
ease-­‐of-­‐use 
All 
DQ/DI 
funcAons 
available 
Executes 
in 
Hadoop, 
no 
data 
movement 
Zero 
footprint 
install, 
nothing 
in 
the 
cluster 
Same 
product 
for 
Hadoop 
and 
database 
26 © RedPoint Global Inc. 2014 Confidential
Twitter Tag: #briefr 
The Briefing Room 
Perceptions & Questions 
Analyst: 
Dr. Claudia Imhoff
Data Quality in the Hadoop Age 
Solve your business puzzles with Intelligent Solutions 
By Claudia Imhoff, PhD 
Intelligent Solutions, Inc. 
Boulder BI Brain Trust 
Claudia@BBBT.US 
SPONSORED BY HOSTED BY 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Claudia Imhoff 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
29 
President and Founder 
Intelligent Solutions, Inc. 
A thought leader, visionary, and practitioner, 
Claudia Imhoff, Ph.D., is an internationally 
recognized expert on analytics, business 
intelligence, and the architectures to support 
these initiatives. Dr. Imhoff has co-authored five 
books on these subjects and writes articles 
(totaling more than 150) for technical and 
business magazines. 
She is also the Founder of the Boulder BI Brain 
Trust (BBBT), an international consortium of 
independent analysts and experts. You can 
follow them on Twitter at #BBBT or become a 
subscriber at www.bbbt.us. Email: claudia@bbbt.us 
Phone: 303-444-6650 
Twitter: Claudia_Imhoff
Agenda 
§ Extending the Data Warehouse Architecture 
§ Things to Ponder… 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
30
Next Generation BI 
Next 
generation 
BI 
Based on a concept by Shree Dandekar of Dell 31 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
Slide compliments of Colin White – BI Research, Inc. 
New business 
insights 
Reduced 
costs 
New 
technologies 
Enhanced 
data 
management 
Advanced 
analytics 
New 
deployment 
options 
DRIVERS 
TECHNOLOGIES
Systems of Record 
§ Remember – It all starts here! 
§ Transactional systems generate most of the data used for all other 
activities – operational processes, BI & analytical capabilities, etc. 
§ The point here is a reminder: 
§ Extend OLTP systems of record as a “key” source of data 
§ Many companies do not (or can not) leverage data they already 
have in their operational systems 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
32 
Operational systems 
RT BI services 
Other internal & external 
structured & multi-structured data 
Real-time streaming data
Next Generation – Extended Data 
Warehouse Architecture (XDW) 
Analytic tools & applications 
RT analysis platform 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
33 
Traditional EDW 
environment 
Investigative computing 
platform 
Data 
refinery 
Data integration 
platform 
Operational real-time environment 
Other internal & external 
structured & multi-structured data 
Real-time streaming data 
Operational systems 
RT BI services Slide created by Colin White – BI Research, Inc.
Use Case: Traditional EDW 
Most BI environments today: 
§ New technologies can be incorporated 
Analytic tools & applications 
Traditional EDW 
environment 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
into the EDW environment to improve 
performance, efficiency & reduce costs 
34 
Use cases: 
§ Production reporting (data quality 
sensitive) 
§ Historical comparisons 
§ Customer analysis (next best offer, 
segmentation, life-time value scores, 
churn analysis, etc.) 
§ KPI calculations 
§ Profitability analysis 
§ Forecasting 
Data integration 
platform 
Operational systems 
RT BI services 
real-time 
models 
& rules
Data Quality Needed 
§ EDW is now the “production” analytical environment 
§ Produces standard reports, comparisons, and analytics to be used 
as final word on situations 
§ Data must be integrated as much as possible 
§ Data must be run through data quality grist mill 
§ There must be a full audit trail from source to ultimate 
report, analytic, etc. 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
35
Use Case: Data Refinery 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
Ingests raw detailed data in batch 
and/or real-time into managed data 
store (lake, hub, swamp, dump…) 
Distills the data into useful business 
information and distributes the results 
to downstream systems 
May also directly analyze some data 
Employs low-cost hardware and 
software to enable large amounts of 
detailed data to be managed cost 
effectively 
Requires (flexible) governance 
policies to manage data security, 
privacy, quality, archiving and 
destruction 
36 
Traditional EDW 
environment 
Investigative computing 
platform 
Data 
refinery 
Data integration 
platform
Data Quality Needed 
§ This is not a data dumping ground! 
§ It should be monitored and assessed as to the data integration and 
quality needs 
§ Just because you can store massive sets of data doesn’t 
mean it is ignored or assumed to not need governance 
§ Nor does it mean that there is no need for a business case 
for the massive amount of data 
§ If analytic accuracy is at 99% using 45% of the data, why deal with 
all of it? 
§ But speed of integration and quality processing is also 
important 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
37
Use Case: Investigative 
Computing 
New technologies used here include: 
§ Hadoop, in-memory computing, 
columnar storage, data compression, 
appliances, etc. 
Use cases: 
§ Data mining and predictive modeling 
for EDW and real-time environments 
§ Cause and effect analysis 
§ Data exploration (“Did this ever 
happen?” “How often?”) 
§ Pattern analysis 
§ General, unplanned investigations 
of data 
Operational systems 
RT BI services 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
38 
Analytic tools & applications 
Investigative computing 
Data 
refinery 
platform 
Data integration 
platform 
RT analysis platform 
Operational real-time environment
Data Quality Needed 
§ Much more experimental in nature – lots of queries with 
null results 
§ Analytics may be approximations 
§ Data integration may be needed for some data, not for 
other 
§ Data quality also varies in terms of what data must go 
through DQ process 
§ Difficulty is in determining what get integrated and run 
through data quality processing 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
39
Use Case: Real Time 
Operational Environment 
Embedded or callable BI 
services: 
§ Real-time fraud detection 
§ Real-time loan risk assessment 
§ Optimizing online promotions 
§ Location-based offers 
§ Contact center optimization 
§ Supply chain optimization 
Real-time analysis engine: 
§ Traffic flow optimization 
§ Web event analysis 
§ Natural resource exploration 
analysis 
§ Stock trading analysis 
§ Risk analysis 
§ Correlation of unrelated data 
streams (e.g., weather effects on 
product sales) 
RT analysis platform 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
40 
Operational real-time environment 
Other internal & external 
structured & multi-structured data 
Real-time streaming data 
Operational systems 
RT BI services
Data Quality Needed 
§ Because of operational nature, data must be as good as it 
can possibly be 
§ Data may or may not bee integrated with other operational 
systems’ data 
§ False positives and negatives to models must be 
reconciled as quickly as possible 
§ But speed of integration and quality processing is of the 
utmost importance! 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
41
All Components Must Work Together 
Investigative 
computing platform 
Analytic tools & apps 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
42 
analytic models 
analyses 
Data refinery 
Traditional EDW 
environment 
Operational systems 
existing 
customer 
data 
next best 
customer offer 
3rd party data 
location data 
social data 
feedback 
RT analysis platform 
call center dashboard 
or web event stream 
Slide created by Colin White – BI Research, Inc. 
Other internal & external 
structured & multi-structured data 
Real-time streaming data
Agenda 
§ Extending the Data Warehouse Architecture 
§ Things to Ponder… 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
43
What Makes People Think 
These Have Gone Away? 
§ Data Redundancy 
§ Each system, application, and department in enterprise collects 
own version of key business entities and attributes 
§ Data Inconsistency 
§ Enormous resources (time, money, and people) spent in 
reconciliation because of fractured data 
§ Business Inefficiency 
§ Fractured data generates business inefficiency – low productivity, 
inefficient supply chain management, customer dissatisfaction, 
wasted marketing efforts 
§ Business Change 
§ Organizations are constantly changing and these disruptive events 
cause a constant stream of changes to data 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44
Data Quality Challenges 
§ Cultural Hurdles 
§ Generating business case and obtaining 
executive backing and funding 
§ Requires a phased approach to quality deployment 
§ Overcoming political barriers 
§ E.g., moving from enterprise view to LOB/parochial 
view of quality, yet still agreeing on common 
business definitions 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45
Data Quality Challenges 
§ Technology Challenges 
§ Unusual sources of data 
§ Creating a flexible data governance model 
§ Supporting complex & constantly changing data 
§ Providing a flexible data integration 
infrastructure 
§ Wild West mentality… 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
46
Data Governance and Data 
Quality is Changing 
§ People using BI must “trust” the data 
§ IT must work with the business to create certified data sets 
§ Note: not all data must be certified but all data usage must be 
documented and monitored 
§ Governance still has an important role 
§ Determine whether data used is “governed” (e.g., in a data 
warehouse or MDM environment) or “ungoverned” (e.g., individual 
spreadsheets, external source) 
§ Difficulty is figuring out differences – hence the need to monitor 
data usage 
§ IT must have monitoring or oversight capability 
Note: LOB IT or experienced information producers may 
have to take on some previously traditional central IT roles 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
47
Questions 
§ What are the biggest challenges for data quality in the 
Hadoop age? 
§ How do you justify the need for integration and quality 
processing in the “age of hurry up and give me the data”? 
§ Not all data needs to be cleaned up and integrated but 
how do people determine what does and doesn’t? 
§ What tips can you give us to help get the time, resources 
and funding for DQ in the refinery? 
§ Technologically speaking, what is different about the 
Hadoop environment versus a traditional RDBMS one? 
§ Who sponsors/is responsible for the data quality/ 
integration effort in the age of Hadoop? 
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 
48
Twitter Tag: #briefr 
The Briefing Room
This Month: INNOVATIVE TECHNOLOGY 
August: BIG DATA ECOSYSTEM 
September: INTEGRATION 
www.insideanalysis.com/webcasts/the-briefing-room 
Twitter Tag: #briefr 
The Briefing Room 
Upcoming Topics 
2014 Editorial Calendar at 
www.insideanalysis.com
Twitter Tag: #briefr 
THANK YOU 
for your 
ATTENTION! 
The Briefing Room

Weitere ähnliche Inhalte

Was ist angesagt?

1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...
IBM
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 

Was ist angesagt? (20)

Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisExploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis Kapsalis
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?
 
Capgemini Insights and Data
Capgemini Insights and Data Capgemini Insights and Data
Capgemini Insights and Data
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Microsoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure WhitepaperMicrosoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure Whitepaper
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 

Ähnlich wie Hadoop 2.0 - Solving the Data Quality Challenge

YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
DataWorks Summit
 
Drive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalakeDrive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalake
The Pathway Group
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
Chandan Das
 

Ähnlich wie Hadoop 2.0 - Solving the Data Quality Challenge (20)

Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?
 
Game Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingGame Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise Thinking
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
A Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameA Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality Game
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Drive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalakeDrive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalake
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both Worlds
 
Amr Ghanem resume
Amr Ghanem resumeAmr Ghanem resume
Amr Ghanem resume
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
The Anywhere Enterprise – How a Flexible Foundation Opens Doors
The Anywhere Enterprise – How a Flexible Foundation Opens DoorsThe Anywhere Enterprise – How a Flexible Foundation Opens Doors
The Anywhere Enterprise – How a Flexible Foundation Opens Doors
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution Showcase
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Customize Your Enterprise Mobile Salesforce.com Integrations with Red Hat
Customize Your Enterprise Mobile Salesforce.com Integrations with Red HatCustomize Your Enterprise Mobile Salesforce.com Integrations with Red Hat
Customize Your Enterprise Mobile Salesforce.com Integrations with Red Hat
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
SURENDRANATH GANDLA4
SURENDRANATH GANDLA4SURENDRANATH GANDLA4
SURENDRANATH GANDLA4
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 

Mehr von Inside Analysis

Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 

Mehr von Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Hadoop 2.0 - Solving the Data Quality Challenge

  • 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  • 2. Hadoop 2.0: Solving the Data Quality Challenge The Briefing Room
  • 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com @eric_kavanagh
  • 4. ! Reveal the essential characteristics of enterprise software, good and bad ! Provide a forum for detailed analysis of today’s innovative technologies ! Give vendors a chance to explain their product to savvy analysts ! Allow audience members to pose serious questions... and get answers! Twitter Tag: #briefr The Briefing Room Mission
  • 5. This Month: INNOVATIVE TECHNOLOGY August: BIG DATA ECOSYSTEM September: INTEGRATION Twitter Tag: #briefr The Briefing Room Topics 2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
  • 6.
  • 7. Twitter Tag: #briefr The Briefing Room Analyst: Dr. Claudia Imhoff Claudia Imhoff is President & Founder of Intelligent Solutions, Inc.
  • 8. Twitter Tag: #briefr The Briefing Room RedPoint Global ! RedPoint Global is a data management and integrated marketing technology company ! Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration. ! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.
  • 9. Twitter Tag: #briefr The Briefing Room Guest: George Corugedo George Corugedo is Chief Technology Officer & Co- Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.
  • 10. The Neglected Discipline of Data Quality in Hadoop July 2014
  • 11. Overview – Challenges to Adoption • Severe Skills Gap shortage of MR skilled resources • Very expensive resources and hard to retain • Inconsistent skills lead to inconsistent results • Under uAlizes exisAng resources • Prevents broad leverage of investments across enterprise Maturity & Governance • A nascent technology ecosystem around Hadoop • Emerging technologies only address narrow slivers of funcAonality • New applicaAons are not enterprise class • Legacy applicaAons have built short term capabiliAes Data Into InformaAon • Data 11 © RedPoint Global Inc. 2014 Confidential is not useful in its raw state, it must be turned into informaAon • Benefit of Hadoop is that same data can be used from many perspecAves • Analysts must now do the structuring of the data based on intended use of the data
  • 12. Key Points to Cover Today ! Broad functionality across data processing domains ! Validated ease of use, speed, match quality and party data superiority ! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so ! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications) ! Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills. ! Big functional footprint without touching a line of code ! Design model consistent with data flow paradigm ! RPDM has a “Zero-Footprint” install in the Hadoop cluster ! The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective. ! Data quality done completely within the cluster 12 © RedPoint Global Inc. 2014 Confidential
  • 13. Key features of RedPoint Data Management ETL & ELT Data Quality Master Key Management Web Services IntegraAon IntegraAon & Matching Process AutomaAon 13 © RedPoint Global Inc. 2014 Confidential & OperaAons • Profiling, reads/writes, transformaAons • Single project for all jobs • Cleanse data • Parsing, correcAon • Geo-­‐spaAal analysis • Grouping • Fuzzy match • Create keys • Track changes • Maintain matches over Ame • Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats • Job scheduling, monitoring, noAficaAons • Central point of control All func(ons can be used on both TRADITIONAL and BIG DATA Creates clean, integrated, ac/onable data – quickly, reliably and at low cost
  • 14. RedPoint Data Management on Hadoop ParAAoning AM / Tasks Parallel SecAon (UI) ExecuAon AM / Tasks Data I/O Key / Split Analysis YARN 14 © RedPoint Global Inc. 2014 Confidential MapReduce
  • 15. RedPoint Functional Footprint Monitoring and Management Tools AMBARI DATA REFINEMENT PIG HIVE MAPREDUCE REST HTTP STREAM STRUCTURE HCATALOG (metadata services) DBs Fil esF il Feilse s NFS Ÿ 15 © RedPoint Global Inc. 2014 Confidential Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream JMS - Flat Queue’s Files - Unstructured - Sentiment - Customer - Inventory Data Sources RDBMS EDW INTERACTIVE HIVE Server2 LOAD SQOOP WebHDFS Flume LOAD SQOO P/Hive Web HDFS YARN Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ n HDFS 1 Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
  • 16. Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { 16 © RedPoint Global Inc. 2014 Confidential RedPoint Benchmarks – Project Gutenberg Map Reduce Pig private final static String delimiters = "',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count'; >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required
  • 17. Attributes of Information RELEVANT InformaAon must pertain to a specific problem. General data must be connected to reveal relevance of the informaAon. COMPLETE ParAal informaAon is oaen worse than no informaAon. ParAal informaAon frequently leads to worse conclusions than if no data had been used at all. ACCURATE This one is obvious. In a context like health care, inaccurate data can be fatal. Precision is required across all applicaAons of informaAon. CURRENT As data ages, it becomes less accurate. MulAple research studies by Google and others show the decay in the accuracy of analyAcs as data becomes stale. ECONOMICAL There has to be a clear cost benefit. This requires work to idenAfy the realizable benefit of informaAon but this is also what rives the use if successful 17 © RedPoint Global Inc. 2014 Confidential
  • 18. Reference Architecture for Matching in Hadoop Data Sources CRM ERP Billing Subscriber Product Network Weather Compete Manuf. Clickstream Online Chat Sensor Data Social Media Call Detail Records FabricaAon Logs Sales Feedback Field Feedback Field Feedback + 18 © RedPoint Global Inc. 2014 Confidential
  • 19. Resource Manager 19 © RedPoint Global Inc. 2014 Confidential Launches Tasks Node Manager DM App Master DM Task Node Manager DM Task DM Task Node Manager DM Task DM Task Launches DM App Master Data Management Designer DM ExecuAon Server Parallel SecAon Running DM Task 1 2 3 RedPoint DM for Hadoop: Processing Flow
  • 20. The Data Management designer 20 © RedPoint Global Inc. 2014 Confidential
  • 21. DM Hadoop Settings 21 © RedPoint Global Inc. 2014 Confidential
  • 22. DM Parallel Section on Hadoop 22 © RedPoint Global Inc. 2014 Confidential
  • 23. Who Should Care ! Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started. ! Companies already investing heavily investing in Big Data Analytics technologies but are stuck due to the shortage of skilled resources ! Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively ! Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data 23 governance © RedPoint applied Global Inc. 2014 to their Hadoop Confidential data.
  • 24. RedPoint Convergent Marketing Ecosystem Data Inputs No SQL Social SQL Enhancement Mobile Social Digital RedPoint Interaction Segmentation Inbox Analysis Attribution GIS Marketing Rules Engine CRM Trigger Audience Offer RedPoint Data Management Machine Learning Analytics Email Address Std. Web Services Geocoding 24 © RedPoint Global Inc. 2014 Confidential Real Time Cache Marketing Operations Analytics Hadoop
  • 25. RedPoint real-time decisions: how it works (web site example) RedPoint update/ maintain over Ame 25 © RedPoint Global Inc. 2014 Confidential www profile data context data real-­‐Ame profile winning content Machine Learning rules inbound personalizaAons combined with outbound contacts to create cross-­‐channel interacAon history web site REDPOINT EXECUTION ENVIRONMENT personalizaAo n opportunity API call perCsOoNnTaElNizTe NdE cEoDnEDt ent content candidate content with associated eligibility & scoring rules content stored in RedPoint, or RedPoint points to content in CMS or other system API Nulla tincidunt dolor sit amet erat. Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris leo. Aenean vel euismod est. Phasellus pretium, sem id varius viverra, nisl elit commodo orci, vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate bibendum. Duis vehicula tellus commodo mauris consequat rutrum eget sit amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, accumsan eu sapien. 1st Party Customer data in database(s) and/or Hadoop
  • 26. RedPoint vs. alternatives ü û ü û ü û ü û ü û ü û ü û Pure YARN, no MapReduce Graphical UI, not code-­‐based Top rated for ease-­‐of-­‐use All DQ/DI funcAons available Executes in Hadoop, no data movement Zero footprint install, nothing in the cluster Same product for Hadoop and database 26 © RedPoint Global Inc. 2014 Confidential
  • 27. Twitter Tag: #briefr The Briefing Room Perceptions & Questions Analyst: Dr. Claudia Imhoff
  • 28. Data Quality in the Hadoop Age Solve your business puzzles with Intelligent Solutions By Claudia Imhoff, PhD Intelligent Solutions, Inc. Boulder BI Brain Trust Claudia@BBBT.US SPONSORED BY HOSTED BY Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
  • 29. Claudia Imhoff Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 29 President and Founder Intelligent Solutions, Inc. A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines. She is also the Founder of the Boulder BI Brain Trust (BBBT), an international consortium of independent analysts and experts. You can follow them on Twitter at #BBBT or become a subscriber at www.bbbt.us. Email: claudia@bbbt.us Phone: 303-444-6650 Twitter: Claudia_Imhoff
  • 30. Agenda § Extending the Data Warehouse Architecture § Things to Ponder… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 30
  • 31. Next Generation BI Next generation BI Based on a concept by Shree Dandekar of Dell 31 Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved Slide compliments of Colin White – BI Research, Inc. New business insights Reduced costs New technologies Enhanced data management Advanced analytics New deployment options DRIVERS TECHNOLOGIES
  • 32. Systems of Record § Remember – It all starts here! § Transactional systems generate most of the data used for all other activities – operational processes, BI & analytical capabilities, etc. § The point here is a reminder: § Extend OLTP systems of record as a “key” source of data § Many companies do not (or can not) leverage data they already have in their operational systems Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 32 Operational systems RT BI services Other internal & external structured & multi-structured data Real-time streaming data
  • 33. Next Generation – Extended Data Warehouse Architecture (XDW) Analytic tools & applications RT analysis platform Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 33 Traditional EDW environment Investigative computing platform Data refinery Data integration platform Operational real-time environment Other internal & external structured & multi-structured data Real-time streaming data Operational systems RT BI services Slide created by Colin White – BI Research, Inc.
  • 34. Use Case: Traditional EDW Most BI environments today: § New technologies can be incorporated Analytic tools & applications Traditional EDW environment Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved into the EDW environment to improve performance, efficiency & reduce costs 34 Use cases: § Production reporting (data quality sensitive) § Historical comparisons § Customer analysis (next best offer, segmentation, life-time value scores, churn analysis, etc.) § KPI calculations § Profitability analysis § Forecasting Data integration platform Operational systems RT BI services real-time models & rules
  • 35. Data Quality Needed § EDW is now the “production” analytical environment § Produces standard reports, comparisons, and analytics to be used as final word on situations § Data must be integrated as much as possible § Data must be run through data quality grist mill § There must be a full audit trail from source to ultimate report, analytic, etc. Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 35
  • 36. Use Case: Data Refinery Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved Ingests raw detailed data in batch and/or real-time into managed data store (lake, hub, swamp, dump…) Distills the data into useful business information and distributes the results to downstream systems May also directly analyze some data Employs low-cost hardware and software to enable large amounts of detailed data to be managed cost effectively Requires (flexible) governance policies to manage data security, privacy, quality, archiving and destruction 36 Traditional EDW environment Investigative computing platform Data refinery Data integration platform
  • 37. Data Quality Needed § This is not a data dumping ground! § It should be monitored and assessed as to the data integration and quality needs § Just because you can store massive sets of data doesn’t mean it is ignored or assumed to not need governance § Nor does it mean that there is no need for a business case for the massive amount of data § If analytic accuracy is at 99% using 45% of the data, why deal with all of it? § But speed of integration and quality processing is also important Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 37
  • 38. Use Case: Investigative Computing New technologies used here include: § Hadoop, in-memory computing, columnar storage, data compression, appliances, etc. Use cases: § Data mining and predictive modeling for EDW and real-time environments § Cause and effect analysis § Data exploration (“Did this ever happen?” “How often?”) § Pattern analysis § General, unplanned investigations of data Operational systems RT BI services Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 38 Analytic tools & applications Investigative computing Data refinery platform Data integration platform RT analysis platform Operational real-time environment
  • 39. Data Quality Needed § Much more experimental in nature – lots of queries with null results § Analytics may be approximations § Data integration may be needed for some data, not for other § Data quality also varies in terms of what data must go through DQ process § Difficulty is in determining what get integrated and run through data quality processing Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 39
  • 40. Use Case: Real Time Operational Environment Embedded or callable BI services: § Real-time fraud detection § Real-time loan risk assessment § Optimizing online promotions § Location-based offers § Contact center optimization § Supply chain optimization Real-time analysis engine: § Traffic flow optimization § Web event analysis § Natural resource exploration analysis § Stock trading analysis § Risk analysis § Correlation of unrelated data streams (e.g., weather effects on product sales) RT analysis platform Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 40 Operational real-time environment Other internal & external structured & multi-structured data Real-time streaming data Operational systems RT BI services
  • 41. Data Quality Needed § Because of operational nature, data must be as good as it can possibly be § Data may or may not bee integrated with other operational systems’ data § False positives and negatives to models must be reconciled as quickly as possible § But speed of integration and quality processing is of the utmost importance! Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 41
  • 42. All Components Must Work Together Investigative computing platform Analytic tools & apps Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 42 analytic models analyses Data refinery Traditional EDW environment Operational systems existing customer data next best customer offer 3rd party data location data social data feedback RT analysis platform call center dashboard or web event stream Slide created by Colin White – BI Research, Inc. Other internal & external structured & multi-structured data Real-time streaming data
  • 43. Agenda § Extending the Data Warehouse Architecture § Things to Ponder… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 43
  • 44. What Makes People Think These Have Gone Away? § Data Redundancy § Each system, application, and department in enterprise collects own version of key business entities and attributes § Data Inconsistency § Enormous resources (time, money, and people) spent in reconciliation because of fractured data § Business Inefficiency § Fractured data generates business inefficiency – low productivity, inefficient supply chain management, customer dissatisfaction, wasted marketing efforts § Business Change § Organizations are constantly changing and these disruptive events cause a constant stream of changes to data Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44
  • 45. Data Quality Challenges § Cultural Hurdles § Generating business case and obtaining executive backing and funding § Requires a phased approach to quality deployment § Overcoming political barriers § E.g., moving from enterprise view to LOB/parochial view of quality, yet still agreeing on common business definitions Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45
  • 46. Data Quality Challenges § Technology Challenges § Unusual sources of data § Creating a flexible data governance model § Supporting complex & constantly changing data § Providing a flexible data integration infrastructure § Wild West mentality… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 46
  • 47. Data Governance and Data Quality is Changing § People using BI must “trust” the data § IT must work with the business to create certified data sets § Note: not all data must be certified but all data usage must be documented and monitored § Governance still has an important role § Determine whether data used is “governed” (e.g., in a data warehouse or MDM environment) or “ungoverned” (e.g., individual spreadsheets, external source) § Difficulty is figuring out differences – hence the need to monitor data usage § IT must have monitoring or oversight capability Note: LOB IT or experienced information producers may have to take on some previously traditional central IT roles Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 47
  • 48. Questions § What are the biggest challenges for data quality in the Hadoop age? § How do you justify the need for integration and quality processing in the “age of hurry up and give me the data”? § Not all data needs to be cleaned up and integrated but how do people determine what does and doesn’t? § What tips can you give us to help get the time, resources and funding for DQ in the refinery? § Technologically speaking, what is different about the Hadoop environment versus a traditional RDBMS one? § Who sponsors/is responsible for the data quality/ integration effort in the age of Hadoop? Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 48
  • 49. Twitter Tag: #briefr The Briefing Room
  • 50. This Month: INNOVATIVE TECHNOLOGY August: BIG DATA ECOSYSTEM September: INTEGRATION www.insideanalysis.com/webcasts/the-briefing-room Twitter Tag: #briefr The Briefing Room Upcoming Topics 2014 Editorial Calendar at www.insideanalysis.com
  • 51. Twitter Tag: #briefr THANK YOU for your ATTENTION! The Briefing Room