SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Leaving Data on
the Table
Data Scientists Reveal Obstacles
to Big Data Analytics
Paradigm4 Data Scientist Survey 2
While Big Data enjoys widespread media coverage, not enough attention has been paid to what
practitioners think — data scientists who manage and analyze massive volumes of data.
We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists
for their help separating Big Data hype from reality. What we learned is that data scientists face multiple
challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data
— and money — on the table.
This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions:
This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics
are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that
can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data
at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel”
problems require using and sharing all the data at once and communicating intermediate results among processes.
Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function.
Their experiences should help inform businesses on what to look for as they investigate options to expand
their analytics infrastructure.
For insight on the issues and obstacles facing data scientists, read on.
We asked data scientists questions such as:
What obstacles prevent them from gaining insights into their data?
How many use Hadoop and which limitations have they encountered
when attempting to use Hadoop for complex analytics?
What data types and sources would they like to leverage more effectively?
Whether they’ll adopt complex analytics solutions (see below)
— and how quickly?
“Complex analytics” means math functions like covariance, clustering, machine learning, principal components
analysis and graph operations.
“Basic analytics” means business intelligence reporting such as sums, counts and aggregates.
Paradigm4 Data Scientist Survey 3
We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But
data scientists say variety presents a bigger challenge. They are at times leaving data out
of their analyses as they wrestle with how to integrate and analyze more types of data such
as time-stamped sensor, location, image and behavioral data as well as network data.
Data scientists are turning to large-scale complex analytics both for unbiased data-
driven exploration and to wrest more value from their data.
For complex analytics, data scientists are forced to move large volumes of data
from existing data stores to dedicated mathematical and statistical computing
software. This time-consuming and coding-intensive step adds no analytical value
and impedes productivity.
While Hadoop has garnered widespread media coverage, 76 percent of data
scientists have encountered serious limitations using it. Hadoop is well suited for
embarrassingly-parallel problems but falls short for large-scale complex analytics.
Incorporating the diverse data types into analytical workflows is a major pain point
for data scientists using traditional relational database software.
For data scientists, Big Data means Big Stress. 39 percent say it’s made their job
more stressful.
1
2
3
4
5
6
The Big Takeaways
Paradigm4 Data Scientist Survey 4
What Is The Biggest Problem You Face In
Gaining Insights From Your Big Data?
Which types of data do you anticipate using in the next year?
The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the
diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent —
said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge.
71%TRUE
I struggle with managing new types and sources of data
I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package)
I don’t know what questions to ask of my data
I know what I want to ask but don’t know how to get the answers
Time-series
Business transaction
Geospatial / Location
Graph (network)
Clickstream
Health records
Sensor
Image
Genomic
I know how to get the answer but my analysis runs out of memory
29%
40%
36%
24%
18%
17%
66%
66%
55%
46%
35%
25%
17%
13%
7%
FALSE
My Analytics Are Becoming More Difficult Because of the Variety
and Types of Data Sources (Not Just the Volume)
Data Variety Is Proving to Be
More Important Than Volume
Paradigm4 Data Scientist Survey 5
The trend toward hyper-personalization and precision targeting illustrates this well.
Recommendations, search results and ads are becoming ever more relevant and micro-targeted
as they tap more and diverse data like social networks, current location, and browsing and
purchasing history. Personalized insurance offerings are augmenting sensor data about driver
behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine
providers are gaining a more refined understanding of what works for whom by integrating
molecular data with clinical, behavioral, electronic health records and environmental data. But
the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big
Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport,
Chapter 1: “Why Big Data is Important to you and your Organization.”)
What It Means:
The ability to effectively use diverse data sources is proving to
be a competitive differentiator in many industries.
Paradigm4 Data Scientist Survey 6
Data Scientists Are Turning to Complex
Analytics to Analyze Their Big Data
When will your company begin to use complex
analytics on your Big Data?
59%
1%
4%
4%
16%
W
e use it now
In
the next 3 years
M
ore than
3 years down
the road
No plans to use com
plex analytics
In
the next 2 years
W
eplantouseitinthenextyear
15%
The point is not to be dazzled by the volume of data,
but rather to analyze it — to convert it into insights,
innovations, and business value.
— Thomas Davenport, “Big Data at Work: Dispelling
the Myths, Uncovering the Opportunities,” page 2.
“
”
Paradigm4 Data Scientist Survey 7
Many new analytical uses require significantly more powerful algorithms and computational
approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly
need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable
for the task. As we have already seen in this survey, organizations are moving from simple SQL
aggregates and summary statistics to next-generation analytics such as machine learning,
clustering, correlation, and principal components analysis on moderately sized data sets. The
move from simple to complex analytics on Big Data presages an emerging need for analytics
that scale beyond single server memory limits and handle sparsity, missing values and mixed
sampling frequencies appropriately. These complex analytics methods can also provide data
scientists with unsupervised and assumption-free approaches, letting all the data speak for itself.
What It Means:
The “low hanging fruit” of Big Data has been exploited.
Paradigm4 Data Scientist Survey 8
Data scientists face another growing challenge: conventional analytic workflows require them to move data
to mathematical and statistical computing software. This workflow made sense with small or sampled data
but is either woefully inefficient or breaks with even moderately large data volumes.
of data scientists utilize software capable of
complex analytics in addition to their data
management software
of data scientists say it takes too long to get
insights from their data because it is too
big to move to their analytics software
Moving Big Data Poses Difficult
Challenges to Data Scientists
78%
36%
Paradigm4 Data Scientist Survey 9
This forces data scientists to make compromises, analyzing samples instead of the whole
data set, leaving data and money on the table. Data scientists risk missing rare events, weak
signals or important anomalies when restricted to working with samples or computing on
subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure:
The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of
Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting
complex analytics over massive data volumes efficiently — without sampling and without
moving the data.
What It Means:
The size and diversity of today’s data sets pose a significant hurdle
to doing more sophisticated analytics because so much time is lost
moving data from files or from a database to analysis tools.
Paradigm4 Data Scientist Survey 10
While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution
for many use cases, especially those that require complex analytics. Fewer than half of data scientists
surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations
to its use.
Hadoop Only Takes You So Far
From the 76% reporting problems, what are the limitations of Hadoop / SPARK?
It takes too much effort to program
It’s too slow for interactive, ad-hoc queries
It’s too slow for real-time analytics
It’s not well-suited for my analytics (not embarrassingly parallel)
39%
37%
30%
22%
of data scientists who tried Hadoop or
SPARK have stopped using it
35%
Paradigm4 Data Scientist Survey 11
But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to
theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead
of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera
has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of
complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters
like Google, Facebook and LinkedIn required a small army of developers to program and maintain
Hadoop. But many organizations either don’t have the required staff or face complex analytics
challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop
infrastructure that has to address these shortcomings or risk being replaced.
What It Means:
Hadoop was unrealistically hyped as a universal and
disruptive Big Data solution.
Paradigm4 Data Scientist Survey 12
Given the growing diversification of data types and sources coupled with the limitations of existing relational
databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their
analytical workflows.
Existing relational database management systems are
inadequate for analyzing the variety of data sources
I am finding it harder to fit my data into relational database tables
TRUE
FALSE
49%
51%
Paradigm4 Data Scientist Survey 13
By comparison, temporal, spatial and network data may be quite sparse (containing
large amounts of missing values), have mixed sampling frequencies and a natural order.
Relational databases require predefined access patterns for each line of inquiry, an obvious
non-starter for data scientists doing ad hoc data exploration.
What It Means:
Relational databases were built for storing and querying densely
populated transactional data such as business purchases and
customer information.
Paradigm4 Data Scientist Survey 14
of data scientists say the growth of Big Data has made
their job more stressful in the last year
say they don’t know which questions to ask of their Big Data
There’s another side of the Big Data story: 39 percent of data scientists say their job has become more
stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job
less stressful.
Big Data Means Big Stress for Data Scientists
Quotes from data scientists:
24%
My biggest problem is linking various data sources.
”“
The data is just too big.
”“
The biggest problem is putting
multiple sources of data together.
”“
39%
Paradigm4 Data Scientist Survey 15
Fulfilling those expectations falls on the data scientist. But outdated software approaches
better suited to traditional transactional data — not today’s diverse data sources and rapidly
growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for
stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and
workflows need to enable data scientists to conduct analysis quickly and efficiently, making
data scientists more productive and lowering stress levels as a result.
What It Means:
Driven in part by media hype, organizations have developed
inflated expectations around the value they’ll get out of Big Data.
Paradigm4 Data Scientist Survey 16
Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But
current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality
and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the
right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact
that so many data scientists identified shortcomings in their infrastructure suggests that the only way
to tell which solution is best suited to your organization is to do a pilot project using your data and
your use cases.
So What?
The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research
firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data
scientists in the U.S.
Paradigm4 is the creator of SciDB, a computational database management system used to solve
large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries
and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables
data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer
harder questions faster.
For more information, visit www.paradigm4.com
About the Survey
About Paradigm4

Weitere ähnliche Inhalte

Was ist angesagt?

Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Self-service analytics risk_September_2016
Self-service analytics risk_September_2016Self-service analytics risk_September_2016
Self-service analytics risk_September_2016Leigh Ulpen
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overviewjkvr
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)Robert Smith
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of Peoplemark madsen
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSteven Kimber
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 

Was ist angesagt? (20)

Analytics 2
Analytics 2Analytics 2
Analytics 2
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Self-service analytics risk_September_2016
Self-service analytics risk_September_2016Self-service analytics risk_September_2016
Self-service analytics risk_September_2016
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of People
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data analytics
Data analyticsData analytics
Data analytics
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data Analytics
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Data analytics
Data analyticsData analytics
Data analytics
 

Ähnlich wie Paradigm4 Research Report: Leaving Data on the table

Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? ScaleFocus
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378Parag Kapile
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBala Iyer
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxPrabhaJoshi4
 
How to start thinking like a data scientist
How to start thinking like a data scientistHow to start thinking like a data scientist
How to start thinking like a data scientistDebashish Jana
 

Ähnlich wie Paradigm4 Research Report: Leaving Data on the table (20)

Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it?
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Mighty Guides- Data Disruption
Mighty Guides- Data DisruptionMighty Guides- Data Disruption
Mighty Guides- Data Disruption
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the Marketspace
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
 
How to start thinking like a data scientist
How to start thinking like a data scientistHow to start thinking like a data scientist
How to start thinking like a data scientist
 

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

Paradigm4 Research Report: Leaving Data on the table

  • 1. Leaving Data on the Table Data Scientists Reveal Obstacles to Big Data Analytics
  • 2. Paradigm4 Data Scientist Survey 2 While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table. This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions: This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel” problems require using and sharing all the data at once and communicating intermediate results among processes. Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function. Their experiences should help inform businesses on what to look for as they investigate options to expand their analytics infrastructure. For insight on the issues and obstacles facing data scientists, read on. We asked data scientists questions such as: What obstacles prevent them from gaining insights into their data? How many use Hadoop and which limitations have they encountered when attempting to use Hadoop for complex analytics? What data types and sources would they like to leverage more effectively? Whether they’ll adopt complex analytics solutions (see below) — and how quickly? “Complex analytics” means math functions like covariance, clustering, machine learning, principal components analysis and graph operations. “Basic analytics” means business intelligence reporting such as sums, counts and aggregates.
  • 3. Paradigm4 Data Scientist Survey 3 We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But data scientists say variety presents a bigger challenge. They are at times leaving data out of their analyses as they wrestle with how to integrate and analyze more types of data such as time-stamped sensor, location, image and behavioral data as well as network data. Data scientists are turning to large-scale complex analytics both for unbiased data- driven exploration and to wrest more value from their data. For complex analytics, data scientists are forced to move large volumes of data from existing data stores to dedicated mathematical and statistical computing software. This time-consuming and coding-intensive step adds no analytical value and impedes productivity. While Hadoop has garnered widespread media coverage, 76 percent of data scientists have encountered serious limitations using it. Hadoop is well suited for embarrassingly-parallel problems but falls short for large-scale complex analytics. Incorporating the diverse data types into analytical workflows is a major pain point for data scientists using traditional relational database software. For data scientists, Big Data means Big Stress. 39 percent say it’s made their job more stressful. 1 2 3 4 5 6 The Big Takeaways
  • 4. Paradigm4 Data Scientist Survey 4 What Is The Biggest Problem You Face In Gaining Insights From Your Big Data? Which types of data do you anticipate using in the next year? The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent — said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge. 71%TRUE I struggle with managing new types and sources of data I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package) I don’t know what questions to ask of my data I know what I want to ask but don’t know how to get the answers Time-series Business transaction Geospatial / Location Graph (network) Clickstream Health records Sensor Image Genomic I know how to get the answer but my analysis runs out of memory 29% 40% 36% 24% 18% 17% 66% 66% 55% 46% 35% 25% 17% 13% 7% FALSE My Analytics Are Becoming More Difficult Because of the Variety and Types of Data Sources (Not Just the Volume) Data Variety Is Proving to Be More Important Than Volume
  • 5. Paradigm4 Data Scientist Survey 5 The trend toward hyper-personalization and precision targeting illustrates this well. Recommendations, search results and ads are becoming ever more relevant and micro-targeted as they tap more and diverse data like social networks, current location, and browsing and purchasing history. Personalized insurance offerings are augmenting sensor data about driver behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine providers are gaining a more refined understanding of what works for whom by integrating molecular data with clinical, behavioral, electronic health records and environmental data. But the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport, Chapter 1: “Why Big Data is Important to you and your Organization.”) What It Means: The ability to effectively use diverse data sources is proving to be a competitive differentiator in many industries.
  • 6. Paradigm4 Data Scientist Survey 6 Data Scientists Are Turning to Complex Analytics to Analyze Their Big Data When will your company begin to use complex analytics on your Big Data? 59% 1% 4% 4% 16% W e use it now In the next 3 years M ore than 3 years down the road No plans to use com plex analytics In the next 2 years W eplantouseitinthenextyear 15% The point is not to be dazzled by the volume of data, but rather to analyze it — to convert it into insights, innovations, and business value. — Thomas Davenport, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” page 2. “ ”
  • 7. Paradigm4 Data Scientist Survey 7 Many new analytical uses require significantly more powerful algorithms and computational approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable for the task. As we have already seen in this survey, organizations are moving from simple SQL aggregates and summary statistics to next-generation analytics such as machine learning, clustering, correlation, and principal components analysis on moderately sized data sets. The move from simple to complex analytics on Big Data presages an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately. These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data speak for itself. What It Means: The “low hanging fruit” of Big Data has been exploited.
  • 8. Paradigm4 Data Scientist Survey 8 Data scientists face another growing challenge: conventional analytic workflows require them to move data to mathematical and statistical computing software. This workflow made sense with small or sampled data but is either woefully inefficient or breaks with even moderately large data volumes. of data scientists utilize software capable of complex analytics in addition to their data management software of data scientists say it takes too long to get insights from their data because it is too big to move to their analytics software Moving Big Data Poses Difficult Challenges to Data Scientists 78% 36%
  • 9. Paradigm4 Data Scientist Survey 9 This forces data scientists to make compromises, analyzing samples instead of the whole data set, leaving data and money on the table. Data scientists risk missing rare events, weak signals or important anomalies when restricted to working with samples or computing on subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure: The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting complex analytics over massive data volumes efficiently — without sampling and without moving the data. What It Means: The size and diversity of today’s data sets pose a significant hurdle to doing more sophisticated analytics because so much time is lost moving data from files or from a database to analysis tools.
  • 10. Paradigm4 Data Scientist Survey 10 While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution for many use cases, especially those that require complex analytics. Fewer than half of data scientists surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations to its use. Hadoop Only Takes You So Far From the 76% reporting problems, what are the limitations of Hadoop / SPARK? It takes too much effort to program It’s too slow for interactive, ad-hoc queries It’s too slow for real-time analytics It’s not well-suited for my analytics (not embarrassingly parallel) 39% 37% 30% 22% of data scientists who tried Hadoop or SPARK have stopped using it 35%
  • 11. Paradigm4 Data Scientist Survey 11 But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters like Google, Facebook and LinkedIn required a small army of developers to program and maintain Hadoop. But many organizations either don’t have the required staff or face complex analytics challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop infrastructure that has to address these shortcomings or risk being replaced. What It Means: Hadoop was unrealistically hyped as a universal and disruptive Big Data solution.
  • 12. Paradigm4 Data Scientist Survey 12 Given the growing diversification of data types and sources coupled with the limitations of existing relational databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their analytical workflows. Existing relational database management systems are inadequate for analyzing the variety of data sources I am finding it harder to fit my data into relational database tables TRUE FALSE 49% 51%
  • 13. Paradigm4 Data Scientist Survey 13 By comparison, temporal, spatial and network data may be quite sparse (containing large amounts of missing values), have mixed sampling frequencies and a natural order. Relational databases require predefined access patterns for each line of inquiry, an obvious non-starter for data scientists doing ad hoc data exploration. What It Means: Relational databases were built for storing and querying densely populated transactional data such as business purchases and customer information.
  • 14. Paradigm4 Data Scientist Survey 14 of data scientists say the growth of Big Data has made their job more stressful in the last year say they don’t know which questions to ask of their Big Data There’s another side of the Big Data story: 39 percent of data scientists say their job has become more stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job less stressful. Big Data Means Big Stress for Data Scientists Quotes from data scientists: 24% My biggest problem is linking various data sources. ”“ The data is just too big. ”“ The biggest problem is putting multiple sources of data together. ”“ 39%
  • 15. Paradigm4 Data Scientist Survey 15 Fulfilling those expectations falls on the data scientist. But outdated software approaches better suited to traditional transactional data — not today’s diverse data sources and rapidly growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and workflows need to enable data scientists to conduct analysis quickly and efficiently, making data scientists more productive and lowering stress levels as a result. What It Means: Driven in part by media hype, organizations have developed inflated expectations around the value they’ll get out of Big Data.
  • 16. Paradigm4 Data Scientist Survey 16 Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact that so many data scientists identified shortcomings in their infrastructure suggests that the only way to tell which solution is best suited to your organization is to do a pilot project using your data and your use cases. So What? The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data scientists in the U.S. Paradigm4 is the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer harder questions faster. For more information, visit www.paradigm4.com About the Survey About Paradigm4