1. TRILLIUM SOFTWARE 2013 CUSTOMER CONFERENCE
(Who’s Afraid of…)
The Big Bad Data Wolf?
The Big Bad Data Challenge – Big Data & the
Data Quality Imperative
Presented By:
Nigel Turner
VP Information
Management Strategy
1
5. What’s different about
Big Data?
New technologies which enable distributed & highly
scalable MPP (Massively Parallel Processing), e.g.
Apache Hadoop
MapReduce
NoSQL databases
Strong emphasis on analytical approaches
Emergence of “data science”
Predictive Analytics
Data Mining
The “democratisation” of data
Data made available to all (cf Cloud Computing)
Business and not IT led BI
5
7. Parallel worlds… or are they (1)?
7
Shared with 100,000+
others and counting…
8. Parallel worlds… or are they (2)?
8
“ I spend the vast majority of my time cleaning
data systems…cleaning and preparing
data sets makes everything I do better
… it’s the highest value activity I do”
Josh Wills
Senior Director of Data Science
Cloudera
(From “Training a new generation of
Data Scientists” – Cloudera video)
9. When Big Data & Data Quality
worlds collide…
9
Big Data will
expose Data Quality
shortcomings
Poor Data Quality
will undermine the
value of Big Data
investments
10. Big Data – building on solid
foundations
BIG DATA / ANALYTICS
DATA QUALITY FOUNDATION
10
11. The 3Vs and the DQ challenge
• Exponential growth of data – predicted 40-60% per
annum
• 2.5 quintillion bytes of data are created every day
• 90% of all digital data created in the last two years
• Data generated more varied and complex than before:
– Text, Audio, Images, Machine Generated etc.
• Much of this data is semi-structured or unstructured
• Traditional IT techniques ill equipped to process &
analyse it
• Data often generated in real time
• Analysis and response needs to be rapid, often also
real time
• Traditional BI / DW environments cannot cope – new
approaches are needed
11
11
12. Big Data –
Foundations of Success
Identifying the right data to solve the business
problem or opportunity
The ability to integrate & match varied data from
multiple data sources
structured, semi-structured, unstructured
Building the right IT infrastructure to support Big
Data applications
Having the right capabilities & skills to exploit
the data
12
12
13. Big Data – some vertical
applications
Retail: using point of sale & social media data to
supplement & enrich traditional CRM / Marketing data
Insurance & Banking: fraud detection
Health: holistic patient analysis
Utilities: consumption peaks & troughs & capacity
planning
Telcos: call routing optimisation & customer churn
Manufacturing: predictive fault identification & supply
chain optimisation
Research: particle analysis, genomics etc.
13
14. Example Big Data benefit:
The Open Big Data Cloud
14
SOURCE: LINKED OPEN DATA (LOD) COMMUNITY
15. Big Data in practice - Volvo
Every Volvo vehicle has hundreds of
microprocessors / sensors
Data generated used within the car itself
but also captured for analysis by Volvo
and its dealers
All data is loaded into a centralised
analysis hub & integrated with CRM,
dealership, product & social network data
Used to optimise design & manufacturing,
enhance customer interaction, improve
safety & act on customer feedback
15
16. Big Data – Barriers & Pitfalls
The sheer volume of data – what’s worth using?
Data extraction challenges
The ability to match data from disparate sources
/ formats / media
The time taken to integrate new data sources
The risks of mismatching and incorrect
identification of individuals
Legal & regulatory pitfalls
Security concerns – corporate & individual
Lack of skills & expertise
16
16
17. Big Data – the data integration
challenge
SOCIAL
MEDIA
SENSORS
OPEN
DATA
EMAIL
MOBILES
EXTERNALDATASOURCES
INTERNALDATASOURCES
CRM
BILLING
OPS
SALES
PRODS
ANALYTICS PLATFORM 1
ANALYTICS PLATFORM 2
ANALYTICS PLATFORM 3
ANALYTICS PLATFORM n
ACTIONABLE INSIGHT & KNOWLEDGE
17
18. Big Data – the Data Quality
Imperative (1)
Need to profile external and internal data sources
Need to classify data to define what data really
matters
Need to assure the quality of internal (and some
external) data sources for accuracy, completeness,
consistency
Need to define & apply business rules & metadata
management to how the data will be defined and
used
Need for a data governance framework to ensure
consistency & control
18
19. Big Data – the Data Quality
Imperative (2)
Need processes & tools to enable:
Source data profiling
Data integration
Data parsing
Data standardisation
Business rule creation & management
Metadata management & a shared business / IT glossary
Data de-duplication
Data normalisation
Data matching
Data enrichment
Data audit
Many of these functions must be capable of
being carried out in real time with zero lag
19
20. Big Data – DQ as the key enabler
SOCIAL
MEDIA
SENSOR
S
OPEN
DATA
EMAIL
EXTERNALDATASOURCES
INTERNALDATASOURCES
CRM
BILLING
OPS
SALES
PRODS
ANALYTICS PLATFORM 1
ANALYTICS PLATFORM 2
ANALYTICS PLATFORM 3
ANALYTICS PLATFORM n
ACTIONABLE INSIGHT & KNOWLEDGE
PROFILE
PARSE
STANDARDISE
MATCH
ENRICH
DATA QUALITY PLATFORM
PROFILE
PARSE
STANDARDISE
MATCH
ENRICH
MOBILES
20
21. Big Data – some algorithms
1. BIG DATA + POOR DATA QUALITY = BIG
PROBLEMS
2. DATA DEMOCRITISATION – DATA GOVERNANCE =
ANARCHY
3. DATA MASH UPS – DATA QUALITY = DATA MESS
4. BIG DATA ANALYTICS + POOR DQ = WRONG
RESULTS
5. BIG DATA – DATA ASSURANCE = JAIL
6. 3V + DATA QUALITY = 4V (VALIDITY)
21
22. Big Data & Data Quality –
summary
• Big Data will depend on
data quality to reap its
claimed benefits – the
GIGO truism
• The democratization of
data will expose poor
DQ
• The need for Data
Governance increases as
data becomes more
accessible
• Data skills will become
more valued for ‘data
science’
• Big Data will increase
the 3Vs of data
• Control of data becomes
more difficult – scope
and variety of use
increases
• Data standards &
business rules become
more complex
• Potential legal &
regulatory minefield
22
22
23. What action should we take as
data management / DQ
professionals?
Identify and get involved in any current or
planned Big Data initiatives within our
organisations
Ensure that the Data Quality and Data
Governance implications & imperatives of these
initiatives are understood
Plan for the new Data Quality and Data
Governance challenges that these trends will
pose
23
23