Weitere ähnliche Inhalte Ähnlich wie Hadoop 2.0 - Solving the Data Quality Challenge (20) Mehr von Inside Analysis (20) Kürzlich hochgeladen (20) Hadoop 2.0 - Solving the Data Quality Challenge3. Twitter Tag: #briefr
The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
4. ! Reveal the essential characteristics of enterprise software,
good and bad
! Provide a forum for detailed analysis of today’s innovative
technologies
! Give vendors a chance to explain their product to savvy
analysts
! Allow audience members to pose serious questions... and get
answers!
Twitter Tag: #briefr
The Briefing Room
Mission
5. This Month: INNOVATIVE TECHNOLOGY
August: BIG DATA ECOSYSTEM
September: INTEGRATION
Twitter Tag: #briefr
The Briefing Room
Topics
2014 Editorial Calendar at
www.insideanalysis.com/webcasts/the-briefing-room
7. Twitter Tag: #briefr
The Briefing Room
Analyst: Dr. Claudia Imhoff
Claudia Imhoff is
President & Founder of
Intelligent Solutions, Inc.
8. Twitter Tag: #briefr
The Briefing Room
RedPoint Global
! RedPoint Global is a data management and integrated
marketing technology company
! Its Convergent Marketing Platform™ offers products
designed for data management, collaboration and
architecture integration.
! RedPoint Data Management for Hadoop is YARN-compliant
and enables analysts to access and manipulate data directly
within the Hadoop cluster.
9. Twitter Tag: #briefr
The Briefing Room
Guest: George Corugedo
George Corugedo is Chief Technology Officer & Co-
Founder at RedPoint Global Inc. A mathematician
and seasoned technology executive, George has
over 20 years of business and technical expertise.
As co-founder and CTO of RedPoint Global, George
is responsible for leading the development of the
RedPoint Convergent Marketing Platform™. A
former math professor, George left academia to
co-found Accenture’s Customer Insight Practice,
which specialized in strategic data utilization,
analytics and customer strategy. Previous positions
include director of client delivery at ClarityBlue,
Inc., a provider of hosted customer intelligence
solutions to enterprise commercial entities, and
COO/CIO of Riscuity, a receivables management
company specializing in the utilization of analytics
to drive collections.
11. Overview – Challenges to Adoption
• Severe
Skills
Gap
shortage
of
MR
skilled
resources
• Very
expensive
resources
and
hard
to
retain
• Inconsistent
skills
lead
to
inconsistent
results
• Under
uAlizes
exisAng
resources
• Prevents
broad
leverage
of
investments
across
enterprise
Maturity
&
Governance
• A
nascent
technology
ecosystem
around
Hadoop
• Emerging
technologies
only
address
narrow
slivers
of
funcAonality
• New
applicaAons
are
not
enterprise
class
• Legacy
applicaAons
have
built
short
term
capabiliAes
Data
Into
InformaAon
• Data
11 © RedPoint Global Inc. 2014 Confidential
is
not
useful
in
its
raw
state,
it
must
be
turned
into
informaAon
• Benefit
of
Hadoop
is
that
same
data
can
be
used
from
many
perspecAves
• Analysts
must
now
do
the
structuring
of
the
data
based
on
intended
use
of
the
data
12. Key Points to Cover Today
! Broad functionality across data processing domains
! Validated ease of use, speed, match quality and party data superiority
! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so
! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data
Management is a pure YARN application (1 of only 2 in the initial wave of
certifications)
! Building a complex job in RPDM takes a fraction of the time that it takes to
write the same job in Map Reduce and none of the coding or java skills.
! Big functional footprint without touching a line of code
! Design model consistent with data flow paradigm
! RPDM has a “Zero-Footprint” install in the Hadoop cluster
! The same interface and functionality is available for both structured and
unstructured databases. Thus it is seamless to work across both from a users
perspective.
! Data quality done completely within the cluster
12 © RedPoint Global Inc. 2014 Confidential
13. Key features of RedPoint Data Management
ETL
&
ELT
Data
Quality
Master
Key
Management
Web
Services
IntegraAon
IntegraAon
&
Matching
Process
AutomaAon
13 © RedPoint Global Inc. 2014 Confidential
&
OperaAons
• Profiling,
reads/writes,
transformaAons
• Single
project
for
all
jobs
• Cleanse
data
• Parsing,
correcAon
• Geo-‐spaAal
analysis
• Grouping
• Fuzzy
match
• Create
keys
• Track
changes
• Maintain
matches
over
Ame
• Consume
and
publish
• HTTP/HTTPS
protocols
• XML/JSON/SOAP
formats
• Job
scheduling,
monitoring,
noAficaAons
• Central
point
of
control
All
func(ons
can
be
used
on
both
TRADITIONAL
and
BIG
DATA
Creates
clean,
integrated,
ac/onable
data
–
quickly,
reliably
and
at
low
cost
14. RedPoint Data Management on Hadoop
ParAAoning
AM
/
Tasks
Parallel
SecAon
(UI)
ExecuAon
AM
/
Tasks
Data
I/O
Key
/
Split
Analysis
YARN
14 © RedPoint Global Inc. 2014 Confidential
MapReduce
15. RedPoint Functional Footprint
Monitoring and Management Tools
AMBARI
DATA REFINEMENT
PIG HIVE
MAPREDUCE
REST
HTTP
STREAM
STRUCTURE
HCATALOG
(metadata services)
DBs
Fil
esF il
Feilse s
NFS
Ÿ
15 © RedPoint Global Inc. 2014 Confidential
Query/Visualization/
Reporting/Analytical
Tools and Apps
SOURCE
DATA
- Sensor Logs
- Clickstream
JMS
- Flat Queue’s
Files
- Unstructured
- Sentiment
- Customer
- Inventory
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
LOAD
SQOO P/Hive
Web HDFS
YARN
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
n
HDFS
1 Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
16. Sample
MapReduce
(small
subset
of
the
entire
code
which
totals
nearly
150
lines):
public
static
class
MapClass
extends
Mapper<WordOffset, Text, Text, IntWritable> {
16 © RedPoint Global Inc. 2014 Confidential
RedPoint
Benchmarks – Project Gutenberg
Map
Reduce
Pig
private
final
static
String delimiters =
"',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿";
private
final
static
IntWritable one = new
IntWritable(1);
private
Text word = new
Text();
public
void
map(WordOffset key, Text value, Context context)
throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new
StringTokenizer(line, delimiters);
while
(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Sample
Pig
script
without
the
UDF:
SET
pig.maxCombinedSplitSize 67108864
SET
pig.splitCombination true
A = LOAD
'/testdata/pg/*/*/*';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS
word;
C = FOREACH B GENERATE UPPER(word) AS
word;
D = GROUP
C BY
word;
E = FOREACH D GENERATE COUNT(C) AS
occurrences, group;
F = ORDER
E BY
occurrences DESC;
STORE F INTO
'/user/cleonardi/pg/pig-count';
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization needed User Defined Functions required
prior to running script
No tuning or optimization
required
17. Attributes of Information
RELEVANT
InformaAon
must
pertain
to
a
specific
problem.
General
data
must
be
connected
to
reveal
relevance
of
the
informaAon.
COMPLETE
ParAal
informaAon
is
oaen
worse
than
no
informaAon.
ParAal
informaAon
frequently
leads
to
worse
conclusions
than
if
no
data
had
been
used
at
all.
ACCURATE
This
one
is
obvious.
In
a
context
like
health
care,
inaccurate
data
can
be
fatal.
Precision
is
required
across
all
applicaAons
of
informaAon.
CURRENT
As
data
ages,
it
becomes
less
accurate.
MulAple
research
studies
by
Google
and
others
show
the
decay
in
the
accuracy
of
analyAcs
as
data
becomes
stale.
ECONOMICAL
There
has
to
be
a
clear
cost
benefit.
This
requires
work
to
idenAfy
the
realizable
benefit
of
informaAon
but
this
is
also
what
rives
the
use
if
successful
17 © RedPoint Global Inc. 2014 Confidential
18. Reference Architecture for Matching in Hadoop
Data
Sources
CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online
Chat
Sensor
Data
Social
Media
Call
Detail
Records
FabricaAon
Logs
Sales
Feedback
Field
Feedback
Field
Feedback
+
18 © RedPoint Global Inc. 2014 Confidential
19. Resource
Manager
19 © RedPoint Global Inc. 2014 Confidential
Launches
Tasks
Node
Manager
DM
App
Master
DM
Task
Node
Manager
DM
Task
DM
Task
Node
Manager
DM
Task
DM
Task
Launches
DM
App
Master
Data
Management
Designer
DM
ExecuAon
Server
Parallel
SecAon
Running
DM
Task
1
2
3
RedPoint DM for Hadoop: Processing Flow
23. Who Should Care
! Companies interested in exploring the promise of Big
Data Analytics and need an easy way to get started.
! Companies already investing heavily investing in Big
Data Analytics technologies but are stuck due to the
shortage of skilled resources
! Large organizations that are focused on “Operational
Offloading” and need to achieve it cost effectively
! Companies who recognize that much of the data
that lands in Hadoop is external to the organization
and need to have Data Quality and proper data
23 governance © RedPoint applied Global Inc. 2014 to their Hadoop Confidential
data.
24. RedPoint Convergent Marketing Ecosystem
Data Inputs
No SQL Social SQL Enhancement
Mobile Social Digital
RedPoint Interaction
Segmentation Inbox Analysis Attribution
GIS
Marketing Rules Engine
CRM Trigger Audience Offer
RedPoint Data Management
Machine Learning Analytics
Email
Address Std.
Web Services
Geocoding
24 © RedPoint Global Inc. 2014 Confidential
Real Time
Cache
Marketing Operations Analytics Hadoop
25. RedPoint real-time decisions: how it works
(web site example)
RedPoint
update/
maintain
over
Ame
25 © RedPoint Global Inc. 2014 Confidential
www
profile
data
context
data
real-‐Ame
profile
winning
content
Machine
Learning
rules
inbound
personalizaAons
combined
with
outbound
contacts
to
create
cross-‐channel
interacAon
history
web
site
REDPOINT
EXECUTION
ENVIRONMENT
personalizaAo
n
opportunity
API
call
perCsOoNnTaElNizTe
NdE
cEoDnEDt
ent
content
candidate
content
with
associated
eligibility
&
scoring
rules
content
stored
in
RedPoint,
or
RedPoint
points
to
content
in
CMS
or
other
system
API
Nulla tincidunt dolor sit amet erat.
Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris
leo. Aenean vel euismod est.
Phasellus pretium, sem id varius viverra, nisl elit commodo orci,
vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate
bibendum.
Duis vehicula tellus commodo mauris consequat rutrum eget sit
amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac
consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare
rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae,
accumsan eu sapien.
1st
Party
Customer
data
in
database(s)
and/or
Hadoop
26. RedPoint vs. alternatives
ü û
ü û
ü û
ü û
ü û
ü û
ü û
Pure
YARN,
no
MapReduce
Graphical
UI,
not
code-‐based
Top
rated
for
ease-‐of-‐use
All
DQ/DI
funcAons
available
Executes
in
Hadoop,
no
data
movement
Zero
footprint
install,
nothing
in
the
cluster
Same
product
for
Hadoop
and
database
26 © RedPoint Global Inc. 2014 Confidential
28. Data Quality in the Hadoop Age
Solve your business puzzles with Intelligent Solutions
By Claudia Imhoff, PhD
Intelligent Solutions, Inc.
Boulder BI Brain Trust
Claudia@BBBT.US
SPONSORED BY HOSTED BY
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
29. Claudia Imhoff
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
29
President and Founder
Intelligent Solutions, Inc.
A thought leader, visionary, and practitioner,
Claudia Imhoff, Ph.D., is an internationally
recognized expert on analytics, business
intelligence, and the architectures to support
these initiatives. Dr. Imhoff has co-authored five
books on these subjects and writes articles
(totaling more than 150) for technical and
business magazines.
She is also the Founder of the Boulder BI Brain
Trust (BBBT), an international consortium of
independent analysts and experts. You can
follow them on Twitter at #BBBT or become a
subscriber at www.bbbt.us. Email: claudia@bbbt.us
Phone: 303-444-6650
Twitter: Claudia_Imhoff
30. Agenda
§ Extending the Data Warehouse Architecture
§ Things to Ponder…
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
30
31. Next Generation BI
Next
generation
BI
Based on a concept by Shree Dandekar of Dell 31
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Slide compliments of Colin White – BI Research, Inc.
New business
insights
Reduced
costs
New
technologies
Enhanced
data
management
Advanced
analytics
New
deployment
options
DRIVERS
TECHNOLOGIES
32. Systems of Record
§ Remember – It all starts here!
§ Transactional systems generate most of the data used for all other
activities – operational processes, BI & analytical capabilities, etc.
§ The point here is a reminder:
§ Extend OLTP systems of record as a “key” source of data
§ Many companies do not (or can not) leverage data they already
have in their operational systems
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
32
Operational systems
RT BI services
Other internal & external
structured & multi-structured data
Real-time streaming data
33. Next Generation – Extended Data
Warehouse Architecture (XDW)
Analytic tools & applications
RT analysis platform
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
33
Traditional EDW
environment
Investigative computing
platform
Data
refinery
Data integration
platform
Operational real-time environment
Other internal & external
structured & multi-structured data
Real-time streaming data
Operational systems
RT BI services Slide created by Colin White – BI Research, Inc.
34. Use Case: Traditional EDW
Most BI environments today:
§ New technologies can be incorporated
Analytic tools & applications
Traditional EDW
environment
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
into the EDW environment to improve
performance, efficiency & reduce costs
34
Use cases:
§ Production reporting (data quality
sensitive)
§ Historical comparisons
§ Customer analysis (next best offer,
segmentation, life-time value scores,
churn analysis, etc.)
§ KPI calculations
§ Profitability analysis
§ Forecasting
Data integration
platform
Operational systems
RT BI services
real-time
models
& rules
35. Data Quality Needed
§ EDW is now the “production” analytical environment
§ Produces standard reports, comparisons, and analytics to be used
as final word on situations
§ Data must be integrated as much as possible
§ Data must be run through data quality grist mill
§ There must be a full audit trail from source to ultimate
report, analytic, etc.
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
35
36. Use Case: Data Refinery
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Ingests raw detailed data in batch
and/or real-time into managed data
store (lake, hub, swamp, dump…)
Distills the data into useful business
information and distributes the results
to downstream systems
May also directly analyze some data
Employs low-cost hardware and
software to enable large amounts of
detailed data to be managed cost
effectively
Requires (flexible) governance
policies to manage data security,
privacy, quality, archiving and
destruction
36
Traditional EDW
environment
Investigative computing
platform
Data
refinery
Data integration
platform
37. Data Quality Needed
§ This is not a data dumping ground!
§ It should be monitored and assessed as to the data integration and
quality needs
§ Just because you can store massive sets of data doesn’t
mean it is ignored or assumed to not need governance
§ Nor does it mean that there is no need for a business case
for the massive amount of data
§ If analytic accuracy is at 99% using 45% of the data, why deal with
all of it?
§ But speed of integration and quality processing is also
important
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
37
38. Use Case: Investigative
Computing
New technologies used here include:
§ Hadoop, in-memory computing,
columnar storage, data compression,
appliances, etc.
Use cases:
§ Data mining and predictive modeling
for EDW and real-time environments
§ Cause and effect analysis
§ Data exploration (“Did this ever
happen?” “How often?”)
§ Pattern analysis
§ General, unplanned investigations
of data
Operational systems
RT BI services
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
38
Analytic tools & applications
Investigative computing
Data
refinery
platform
Data integration
platform
RT analysis platform
Operational real-time environment
39. Data Quality Needed
§ Much more experimental in nature – lots of queries with
null results
§ Analytics may be approximations
§ Data integration may be needed for some data, not for
other
§ Data quality also varies in terms of what data must go
through DQ process
§ Difficulty is in determining what get integrated and run
through data quality processing
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
39
40. Use Case: Real Time
Operational Environment
Embedded or callable BI
services:
§ Real-time fraud detection
§ Real-time loan risk assessment
§ Optimizing online promotions
§ Location-based offers
§ Contact center optimization
§ Supply chain optimization
Real-time analysis engine:
§ Traffic flow optimization
§ Web event analysis
§ Natural resource exploration
analysis
§ Stock trading analysis
§ Risk analysis
§ Correlation of unrelated data
streams (e.g., weather effects on
product sales)
RT analysis platform
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
40
Operational real-time environment
Other internal & external
structured & multi-structured data
Real-time streaming data
Operational systems
RT BI services
41. Data Quality Needed
§ Because of operational nature, data must be as good as it
can possibly be
§ Data may or may not bee integrated with other operational
systems’ data
§ False positives and negatives to models must be
reconciled as quickly as possible
§ But speed of integration and quality processing is of the
utmost importance!
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
41
42. All Components Must Work Together
Investigative
computing platform
Analytic tools & apps
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
42
analytic models
analyses
Data refinery
Traditional EDW
environment
Operational systems
existing
customer
data
next best
customer offer
3rd party data
location data
social data
feedback
RT analysis platform
call center dashboard
or web event stream
Slide created by Colin White – BI Research, Inc.
Other internal & external
structured & multi-structured data
Real-time streaming data
43. Agenda
§ Extending the Data Warehouse Architecture
§ Things to Ponder…
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
43
44. What Makes People Think
These Have Gone Away?
§ Data Redundancy
§ Each system, application, and department in enterprise collects
own version of key business entities and attributes
§ Data Inconsistency
§ Enormous resources (time, money, and people) spent in
reconciliation because of fractured data
§ Business Inefficiency
§ Fractured data generates business inefficiency – low productivity,
inefficient supply chain management, customer dissatisfaction,
wasted marketing efforts
§ Business Change
§ Organizations are constantly changing and these disruptive events
cause a constant stream of changes to data
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44
45. Data Quality Challenges
§ Cultural Hurdles
§ Generating business case and obtaining
executive backing and funding
§ Requires a phased approach to quality deployment
§ Overcoming political barriers
§ E.g., moving from enterprise view to LOB/parochial
view of quality, yet still agreeing on common
business definitions
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45
46. Data Quality Challenges
§ Technology Challenges
§ Unusual sources of data
§ Creating a flexible data governance model
§ Supporting complex & constantly changing data
§ Providing a flexible data integration
infrastructure
§ Wild West mentality…
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
46
47. Data Governance and Data
Quality is Changing
§ People using BI must “trust” the data
§ IT must work with the business to create certified data sets
§ Note: not all data must be certified but all data usage must be
documented and monitored
§ Governance still has an important role
§ Determine whether data used is “governed” (e.g., in a data
warehouse or MDM environment) or “ungoverned” (e.g., individual
spreadsheets, external source)
§ Difficulty is figuring out differences – hence the need to monitor
data usage
§ IT must have monitoring or oversight capability
Note: LOB IT or experienced information producers may
have to take on some previously traditional central IT roles
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
47
48. Questions
§ What are the biggest challenges for data quality in the
Hadoop age?
§ How do you justify the need for integration and quality
processing in the “age of hurry up and give me the data”?
§ Not all data needs to be cleaned up and integrated but
how do people determine what does and doesn’t?
§ What tips can you give us to help get the time, resources
and funding for DQ in the refinery?
§ Technologically speaking, what is different about the
Hadoop environment versus a traditional RDBMS one?
§ Who sponsors/is responsible for the data quality/
integration effort in the age of Hadoop?
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
48
50. This Month: INNOVATIVE TECHNOLOGY
August: BIG DATA ECOSYSTEM
September: INTEGRATION
www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
2014 Editorial Calendar at
www.insideanalysis.com