More Related Content Similar to TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence (20) More from Fitzgerald Analytics, Inc. (18) TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence1. Big Data and Business Intelligence
Must Converge
Tony Baer
tony.baer@ovum.com
March 6, 2013
1 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
2. Agenda
Challenges traditional data stewardship practice
Privacy – is all the world a stage?
Limits to data lifecycle?
Data quality: the big, the bad, the ugly – and it all might be good!
2 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
3. Data stewardship challenges –
What’s old is new
Remember?
Back to undifferentiated ‘gobblobs’ of data
Programmatic access reigns
File systems, not (always) tables 10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET
/inventory/index.jsp HTTP/1.1" 200 4028
"http://www.mycompany.com/index.jsp" "Mozilla/4.08 [en]
(Win98; I ;Nav)"
Batch is back 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1,
172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif,
-, 172.16.255.255, anonymous, 03/20/01, 23:58:11,
MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0,
But… if index(tempvalue,'?') then
tempvalue=scan(tempvalue,1,'?');
else if index(tempvalue,'&')>1 then
tempvalue=scan(tempvalue,1,'&');
Volume, variety, velocity, and where’s the
value??
Just because you can, should you?
3 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
4. Data stewardship questions for Big Data
Can we, should we “control” this data?
Are there limits to how much we should know?
Can we just keep piling up data forever?
Can we cleanse terabytes of data?
Do we still need “good” data?
4 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
5. Agenda
Challenges traditional data stewardship practice
Privacy – is all the world a stage?
Limits to data lifecycle?
Data quality: the big, the bad, the ugly – and it all might be good!
5 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
6. Privacy –
the more things change…
“You have zero privacy
anyway…. Get over it”
-- Scott McNealy, 1999
Facebook does not actually
delete images… but instead
merely removes the links – a fix
“is in sight”
-- ZDNet, 2/6/12
Facebook agrees to 20 years of
federal privacy audits
-- NY Times, 11/29/11
6 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
7. What privacy?
Florida made $63m last
year by selling DMV
information (name, date
of birth, type of vehicle
driven) to companies like
LexusNexus & Shadow
Soft.
-- Terence Craig & Mary Ludloff
Privacy and Big Data
(O’Reilly Media, 2011)
7 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
8. Big Data privacy 101 –
Don’t be creepy
Governance problem first, How Companies Learn Your
technology second Secrets
Understand the relationship
with your customers & business
partners
Keep communications in
context
Don’t catch your customers by “My daughter got this in the mail!” he
surprise said. “She’s still in high school, and
you’re sending her coupons for baby
clothes and cribs? Are you trying to
The law still trying to catch up encourage her to get pregnant?”
-- NY Times 2/16/12
8 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
9. Agenda
Challenges traditional data stewardship practice
Privacy – is all the world a stage?
Limits to data lifecycle?
Data quality: the big, the bad, the ugly – and it all might be good!
9 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
10. Data lifecycle –
How long can this go on?
Google, Yahoo, Facebook, etc.
don’t deprecate web data
Hadoop designed for
economical scale-out
Moore’s Law, declining cost of
storage
Is Hadoop Archive the answer?
Is Hadoop the new tape?
Management & skills will be the limit Aerial view of Quincy, WA data ctrs
10 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
11. Agenda
Challenges traditional data stewardship practice
Privacy – is all the world a stage?
Limits to data lifecycle?
Data quality: the big, the bad, the ugly – and it all might be
good!
11 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
12. Data Quality & Hadoop –
Big Quality Questions
Can we cleanse terabytes of data?
Do we still need “good” data?
Are there new approaches to cleansing Big Data?
12 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
13. Framing the issue
“Garbage in, garbage out,’ but DW forced the
issue
Traditional approaches
Profiling, cleansing, MDM
DW vs. Hadoop data quality challenges
Known data sets & known criteria vs. vaguely known
Bounded vs. less bounded tasks
Limitations of MapReduce*
Cleansing & transformation within a single Map
operation;
Profiling & matching of unstructured data
Matching of data in operations without inter-process
communications
*Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at
http://www.dataroundtable.com/?p=8841
13 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
14. Is data quality necessary for Hadoop?
The App
How mission-critical?
Regulatory compliance impacts?
What degree of business impact?
The Data
The 4V’s (volume, variety, velocity,
value) determine what approaches
to quality are feasible
14 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
15. Examples
Web ad placement optimization
Counter-party risk management
for capital markets
Customer sentiment analysis
Managing smart utility grids or
urban infrastructure
15 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
16. Bad data may be good
Sensory data
Outlier or drift?
Time to recalibrate devices?
Time to perform preventive
maintenance?
Are new/unaccounted environmental
factors skewing readings?
Human-readable data
Flawed concept of reality?
Flawed assumptions on data meaning?
Changes producing ‘new norm’
16 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
17. Big Data quality in Hadoop –
Emergent approaches
Crowdsourcing data –
Collect data far & wide from as many diverse sources as possible. Torrents of data
overcome the noise.
Comparative trend analysis of incoming streams to dynamically ID the norm or
sweet spot of “good” data
Apply data science to “correct the dots”
Don’t go record by record. Statistically analyze the data set in aggregate.
Iteratively analyze & re-analyze nature of data, keep analyzing outliers
Apply off-the-wall approaches
Enterprise Architectural approach
Semantic (domain) model-driven
Apply cleansing logic at run time
Critical for sensitive, regulatory-driven apps
17 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
18. Summary
Challenges traditional data stewardship practice
Combination of old & new
Privacy – is all the world a stage?
Best practices, legal requirements still in flux
Don’t be creepy!
Limits to data lifecycle?
Few enterprises are Google or Facebook
Ability to manage large infrastructure will be major limit
Data quality
Strategy depends on type of app & data set(s)
A spectrum of approaches -- from none to classic ETL to aggregate statistical
No single silver bullet
18 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
19. Disclaimer
All Rights Reserved.
No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form by any means, electronic, mechanical, photocopying,
recording or otherwise, without the prior permission of the publisher, Ovum
(an Informa business).
The facts of this report are believed to be correct at the time of publication but
cannot be guaranteed. Please note that the findings, conclusions and
recommendations that Ovum delivers will be based on information gathered in
good faith from both primary and secondary sources, whose accuracy we are not
always in a position to guarantee. As such Ovum can accept no liability whatever
for actions taken based on any information that may subsequently prove to be
incorrect.
19 © Copyright Ovum. All rights reserved. Ovum is an Informa business.