The key findings of the survey of 314 big data professionals are:
- 87% said 'bad data' pollutes their data stores and 74% said 'bad data' is currently in their stores. Ensuring data quality was the top challenge cited.
- 72% build data flows through hand coding while 53% change pipelines several times per month.
- Only 12% rated their ability to detect issues like stopped pipelines or degraded performance as 'good' or 'excellent'.
- There are significant gaps between the real-time visibility needed and what current tools provide across metrics like error rates, divergent data, and privacy detection.
- 81% said upgrading big data components has significant operational impact.
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Bad Data is Polluting Big Data
1. 1 Sponsored by:
Sponsored by:
‘Bad Data’ Is Polluting Big Data
Enterprises Struggle with Real-Time Control of Data Flows
A Global Survey of Big Data Professionals
June 2016
2. 2
Executive Summary
The big data market is still maturing, especially as relates to
data in motion and as evidenced by lack of best practices or
consistent processes to clean and manage data quality. For
companies who use big data to optimize current business
operations or to make strategic decisions, it is critical
that they ensure their big data teams have real-time
visibility and control over the data at all times.
This report finds that companies who are leveraging big data are rarely
capable of controlling their data flows. Almost 9 out of 10 companies
report ‘bad data’ polluting their data stores and shockingly nearly 3/4
indicate there is ‘bad data’ in their stores currently. The findings also
reveal a chasm between the problem detection capabilities data experts
have today and what they desire. This translates into a lack of real-time
visibility and control of data flows, operations, quality and security.
3. 3 Sponsored by:3
Key Findings
• 87% state ‘bad data’ pollutes their data stores while 74% state ‘bad data’ is
currently in their data stores
• Ensuring data quality was the most common challenge cited, by 68% of
respondents, and only 34% claimed to be good at detecting divergent data
• 72% responded that they hand code their data flows while 53% claimed they
have to change each pipeline at least several times a month
• Tremendous gaps exist between today’s big data flow management tools’
capabilities and what is needed
• Only 10% of respondents rated their performance as good or excellent across 5
key data flow operational performance areas
• 72% desire a single pane of glass solution to manage all data flows
• 81% state there is a significant operational impact when they upgrade big data
components
5. 5 Sponsored by:5
Research Goal
The primary research goal was to capture how
companies manage the flow of big data. The
research also investigated and documented current
tools’ capabilities, data quality and efforts to maintain
big data pipelines and infrastructure
Goals and Methodology
Methodology
Big data professionals worldwide were invited to
participate in a survey on the topic of big data and
ensuring data flow operations and data quality.
The survey was administered electronically and
participants were offered a token compensation for
their participation.
Participants A total of 314 participants that manage big data
operations completed the survey.
6. 6 Sponsored by:6
Companies Represented
Industry Size
500 - 1,000
25%
1,000 - 5,000
29%
5,000 - 10,000
16%
More than
10,000
30%
2%
1%
1%
1%
1%
4%
5%
5%
5%
6%
6%
6%
10%
12%
18%
18%
0% 5% 10% 15% 20%
Other
Food and Beverage
Hospitality and Entertainment
Media and Advertising
Non-Profit
Retail
Transportation
Energy and Utilities
Telecommunications
Government
Services
Education
Healthcare
Manufacturing
Financial Services
Technology
7. 7 Sponsored by:7
Participant Demographics
LocationRole
6%
8%
17%
34%
52%
56%
0% 10% 20% 30% 40% 50% 60%
Business analyst
Business stakeholder who uses
data to make decisions
BI or Analytics Technology Owner
(e.g. data architect, head of data
platform)
IT executive with data initiatives
in my portfolio
IT manager responsible for
delivering data initiatives
IT staff responsible for
implementing and operating data
infrastructure (e.g. database…
United States or
Canada
75%
Europe
14%
Mexico, Central
America, or South
America
4%
Australia or New
Zealand
3%
Middle East or
Africa
2%
Asia
2%
9. 9 Sponsored by:
What challenges
does your company
face when managing
your big data flows?
Top 3 Challenges for Big Data Flows are
Quality, Security and Reliable Operation
1%
32%
40%
47%
52%
60%
68%
0% 10% 20% 30% 40% 50% 60% 70% 80%
We have no challenges
Adapting pipelines to meet new requirements
Upgrading big data infrastructure components
(Kafka, Hadoop, etc.).
Building pipelines for getting data into the data
store
Keeping data flow pipelines operating effectively
Complying with security and data privacy policies
Ensuring the quality of the data (accuracy,
completeness, consistency)
10. 10 Sponsored by:
Does ‘bad data’
occasionally get into
your data stores?
87% State ‘Bad Data’ Pollutes Their Data
Stores
Yes
87%
No
13%
11. 11 Sponsored by:
Do you believe there
is any ‘bad data’ in
your data stores
currently?
74% State ‘Bad Data’ is Currently in Their
Data Stores
Yes
74%
No
26%
12. 12 Sponsored by:
How does your
company build big
data flow pipelines
today?
77% of Companies Still Use Hand Coding to
Build Big Data Flows
27%
63%
77%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Using big data ingestion tools such as StreamSets,
NiFi, etc.
Using ETL or data integration tools
Coding with Python, Java, etc. or low-level
frameworks such as Sqoop, Flume or Kafka
13. 13 Sponsored by:
On average, how
often are changes or
fixes made to typical
data flow pipeline?
53% Change Data Flow Pipelines At Least Several
Times a Month
3%
19%
31%
26%
12%
8%
0%
5%
10%
15%
20%
25%
30%
35%
Several times a
day
Several times a
week
Several times a
month
Several times a
quarter
Several times a
year
Less often than
several times a
year
14. 14 Sponsored by:
When data structure
or semantics
unexpectedly
change, how big is
the impact on the
operation of your big
data flows (failures,
slowdowns, data
corruption, etc.)?
85% State Unexpected Structure and Semantic Changes
Have Substantial Impact on Dataflow Operations
31% 54% 11%2%2%
0% 20% 40% 60% 80% 100%
Significant impact
Moderate impact
Minor impact
Structure and semantic changes
have no effect on our big data
flows
Data structure and semantic
changes never occur
15. 15 Sponsored by:
How would you
assess your
ability to detect
each of the
following issues
in real-time?
More Than Half of Companies Lack Real
Time Information About Data Flow Quality
18%
5%
7%
7%
16%
33%
29%
37%
37%
46%
30%
43%
38%
37%
29%
13%
20%
16%
17%
9%
6%
3%
1%
1%
1%
0% 10%20%30%40%50%60%70%80%90%100%
Personally identifiable information (credit
card numbers, social security numbers) is
being inappropriately placed in a data store
The values of incoming data are diverging
from historical norms
Error rates are increasing
Data flow throughput is degrading or latency
is growing
A specific data flow pipeline has stopped
operating
Excellent
Good
Average
Poor
None
16. 16 Sponsored by:
Only 12% Rated Their Performance as ‘Good’ or
‘Excellent’ Across All Five Key Data Flow Metrics
1. A specific data flow pipeline has
stopped operating
2. Data flow throughput is
degrading or latency is growing
3. Error rates are increasing
4. The values of incoming data are
diverging from historical norms
5. Identify personally information
within the data flows
Five Key Data Flow Metrics
Number of Key Data Flow Metrics Participants Represented as ‘Good’ or ‘Excellent’
19% 17% 19% 20% 12% 12%
1
Metrics
0
Metrics
All 5
Metrics
4
Metrics
3
Metrics
2
Metrics
17. 17 Sponsored by:
In your opinion, how
valuable would it be
to be able to detect
each of these issues
in real-time?
Substantial Value In Real-Time Data Flow
Detection Capabilities
40%
23%
33%
28%
42%
35%
46%
46%
49%
42%
18%
26%
17%
20%
14%
6%
4%
4%
3%
3%
0% 20% 40% 60% 80% 100%
Identify personally information within
the data flows
The values of incoming data are
diverging from historical norms
Error rates are increasing
Data flow throughput is degrading or
latency is growing
A specific data flow pipeline has
stopped operating
Very valuable
Valuable
Average value
Limited value
Not valuable
18. 18 Sponsored by:
Gap Between Current Pipeline Real-Time
Visibility Capabilities and Stated Value
42%
16%
42%
46%
14%
29%
3%
9%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Assessed value
Real-time ability
Excellent/ Very valuable
Good/ Valuable
Average/ Average value
Poor/ Limited value
None/ Not valuable
A specific data flow pipeline has stopped operating
62%
84%
19. 19 Sponsored by:
B. Data flow throughput is degrading or latency is growing
Chasm Between Today’s Data Flow
Throughput Metrics and What is Needed
28%
7%
49%
37%
20%
37%
3%
17%
1%
1%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Assessed value
Real-time ability
Excellent/ Very valuable
Good/ Valuable
Average/ Average value
Poor/ Limited value
None/ Not valuable
44%
77%
Data flow throughput is degrading or latency is growing
20. 20 Sponsored by:
Significant Gap Between Error Rate
Visibility Value and Current Capabilities
33%
7%
46%
37%
17%
38%
4%
16%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Assessed value
Real-time ability
Excellent/ Very valuable
Good/ Valuable
Average/ Average value
Poor/ Limited value
None/ Not valuable
44%
79%
Error rates are increasing
21. 21 Sponsored by:
Chasm Between Value of Detecting
Divergent Data and Current Capabilities
23%
5%
46%
29%
26%
43%
4%
20%
1%
3%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Assessed value
Real-time ability
Excellent/ Very
valuable
Good/ Valuable
Average/ Average
value
Poor/ Limited value
None/ Not valuable
34%
69%
The values of incoming data are diverging from historical norms
22. 22 Sponsored by:
Large Gap Between Data Privacy Value and
Current Capabilities
40%
18%
35%
33%
18%
30%
6%
13%
2%
6%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Assessed value
Real-time ability
Excellent/ Very valuable
Good/ Valuable
Average/ Average value
Poor/ Limited value
None/ Not valuable
51%
75%
Identify personal information within the data flows
23. 23 Sponsored by:
How valuable is it to
have a single control
panel for
comprehensive
visibility and
management across
all of your data
flows?
72% Desire A Single Pane of Glass Solution
To Manage All Data Flows
24% 48% 24% 4%
0% 20% 40% 60% 80% 100%
Very valuable
Valuable
Average value
Limited value
24. 24 Sponsored by:
Which of the
following do you
consider to be the
most effective
approach to ensuring
data quality?
50% State that Data Cleansing at the Source
is the Most Effective Quality Practice
Cleanse data as it
flows in from the
source
50%
Cleanse and update
data once it is in the
store
27%
Data scientists or
business analysts
cleanse data before
using it
23%
25. 25 Sponsored by:
What is the
operational impact of
upgrading big data
components (ingest
technologies,
message queues,
data stores, search
stores, etc.)?
81% State There is Significant Operational
Impact to Upgrading Big Data Components
17% 64% 17% 2%
0% 20% 40% 60% 80% 100%
Heavy impact
Moderate impact
Minor impact
No impact
26. 26 Sponsored by:26
For more information…
About Dimensional Research
Dimensional Research provides practical marketing research to help technology companies make
smarter business decisions. Our researchers are experts in technology and understand how
corporate IT organizations operate. Our qualitative research services deliver a clear
understanding of customer and market dynamics.
For more information, visit www.dimensionalresearch.com.
About StreamSets
Place holder
For more information, visit www.streamsets.com.
28. 28 Sponsored by:
Tremendous Gaps Exist Between Currant Big Bata Flow
Management Tool Capabilities and What is Needed
Ability to Detect Area in Real-Time Compared Against Stated Value To Detect in Real-Time
18%
40%
5%
23%
7%
33%
7%
28%
16%
42%
33%
35%
29%
46%
37%
46%
37%
49%
46%
42%
30%
18%
43%
26%
38%
17%
37%
20%
29%
14%
13%
6%
20%
4%
16%
4%
17%
3%
9%
3%
6%
2%
3%
1%
1%
0%
1%
1%
1%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Personally identifiable information (credit card numbers, social
security numbers) is being inappropriately placed in a data store
The values of incoming data are diverging from historical norms
Error rates are increasing
Data flow throughput is degrading or latency is growing
A specific data flow pipeline has stopped operating
Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable
Stated Value
Current Ability
Stated Value
Current Ability
Stated Value
Current Ability
Stated Value
Current Ability
Stated Value
Current Ability
29. 29 Sponsored by:
Which of the
following approaches
for ensuring data
quality does your
company utilize?
Various Approaches To Managing Data
Quality Indicates a Lack of Best Practice
43%
54%
55%
0% 10% 20% 30% 40% 50% 60%
Data scientists or business analysts cleanse data
before using it
Cleanse data as it flows in from the source
Cleanse and update data once it is in the store
30. 30 Sponsored by:
Approximately, what
percentage of data
flow changes and
fixes are made for
day-to-day
maintenance and
troubleshooting
purposes?
Many Must Perform Maintenance and
Troubleshooting on Data Flows Routinely
3%
10%
24%
27%
36%
0%
5%
10%
15%
20%
25%
30%
35%
40%
More than 80% 60% - 80% 40% - 60% 20% - 40% Less than 20%