SlideShare ist ein Scribd-Unternehmen logo
1 von 3
Downloaden Sie, um offline zu lesen
CSI Communications | April 2013 | 7
Big Data Systems: Past, Present &
(possibly) Future
Dr. Milind Bhandarkar
Chief Scientist, Machine Learning Platforms,
Greenplum, A Division of EMC2
Cover
Story
The data management industry has
matured over the last three decades,
primarily based on Relational Data
Base Management Systems (RDBMS)
technology. Even today, RDBMS systems
power a majority of backend systems for
online digital media, financial systems,
insurance, healthcare, transportation,
and telecommunications companies.
Since the amount of data collected, and
analyzed in enterprises has increased
several-folds in volume, variety, and
velocity of generation and consumption,
organizations have started struggling
with architectural limitations of traditional
RDBMS architectures. As a result, a new
class of systems had to be designed and
implemented, giving rise to the new
phenomenon of “Big Data”.
In this article, we will trace the origin
and history of this new class of systems
to handle “Big Data”. We refer to current
popular big data systems, exemplified by
Hadoop, and discuss some current and
future use-cases of these systems.
What is Big Data?
While there is no universally accepted
definition of Big Data yet, and most of
the attention in the press is devoted to
the “Bigness” of Big Data, volume of data
is only one factor in the requirements
of modern data processing platforms.
Industry analyst firm Gartner[1]
defines Big
Data as:
Big data is high-volume, high-velocity,
and high-variety information assets that
demand cost-effective, innovative forms
of information processing for enhanced
insight and decision-making.
A recent IDC study, sponsored
by EMC2[2]
, predicts that the “digital
universe”, the data that is generated in
digital form by humankind, will double
every two years, and will reach 40,000
exabytes (40 * 1021
bytes) by 2020. A
major driving factor behind this data
growth is ubiquitous connectivity via
rapidly growing reach of mobile devices,
constantly connected to the networks.
What is even more remarkable, is thatonly
a small portion of this digital universe
is “visible”, which is the data (videos,
pictures, documents, status updates,
tweets) created and consumed by
consumers. A vast amount of data will be
created not “by” human users, but “about”
humans by the digital universe, and it
will be stored, managed, and analyzed by
the enterprises, such as Internet service
providers, and cloud service providers of
all varieties (Infrastructure-as-a-service,
Platform-as-a-Service, and Software-as-
a-Service.)
Origins of Big Data Infrastructure
We already notice this rapid growth
of data generation in the online world
around us. Facebook has grown from one
Million users in 2004, to more than one
Billion in 2012, a thousand-fold increase in
less than eight years. More than 60% of
these users access Facebook from mobile
phones today. The value generated by
a social network is proportional to the
number of contacts between users of the
social network, rather than the number of
users. According to Metcalfe’s Law[3]
, and
its variants, the number of contacts for
N users is proportional to N*logN. Thus,
the growth of contacts, and therefore
the interactions within a social network,
which results in data generation, is non-
linear with respect to number of users. As
the world gets more connected, one can
expect the number of interactions to grow,
resulting in even more accelerated data
growth.
Since the popularity of Internet
was one of the main reasons for growth
of communication and connectivity in
the world, we saw emergence of Big
Data platforms in the Internet industry.
Google, founded in 1998 with the goal
of organizing all the information in the
world, became the dominant content
discovery platform on the World Wide
Web, trumping human-powered and
semi-automated approaches, such as web
portals and directories. The challenges
Google faced in crawling the web, storing,
indexing, ranking, and serving billions
of web pages could not be solved with
the existing data management systems
economically.Amountofpubliclyavailable
content on the web in Google’s search
index exploded from 26 Million pages in
1998, to more than 1 Trillion in less than
a decade[4]
.In addition, this content was
“multi-structured”, consisting of natural-
language text, images, video, geo-spatial,
and even renderings of structured data. In
order to rapidly answer the search queries,
with information ranked by relevance as
well as timeliness, Google had to develop
its infrastructure from scratch. In 2003
and 2004, Google published details of
a part of its infrastructure, in particular,
the Google File System (GFS)[5]
, and
MapReduce programming framework[6]
.
These two publications became the
blueprint for Apache Hadoop, an open
source framework that has become a
de facto standard for big data platforms
deployed today.
Apache Hadoop
TheGFSandMapReducepapersmotivated
Doug Cutting, creator of an open-source
search engine, Apache Lucene, to re-
architect the content system of Lucene,
called Nutch, to incorporate a distributed
file system, and MapReduce programming
framework for tasks of crawling, storing,
ranking, and indexing web pages so that
they could be served as search results
by Lucene. These developments were
noticed by engineers and executives at
Yahoo, which was then struggling to scale
its search backend infrastructure. Yahoo
adopted Apache Hadoop in January
2006, and made significant contributions
to make it a scalable and stable platform.
Today, Yahoo has the largest footprint
of Apache Hadoop, running more than
45,000 servers managing more than 370
Petabytes of data with Hadoop[7]
. Being an
open source system, licensed under the
liberal Apache Software License, governed
bytheApacheSoftwareFoundation,meant
that Hadoop could be freely downloaded
and deployed in any organization, modified
and used without any hard requirement of
having to contribute the changes back to
open source. The scalability and flexibility
of Apache Hadoop prompted growing
Internet companies such as Facebook,
Twitter, and LinkedIn to adopt it for
their data infrastructure, and contribute
CSI Communications | April 2013 | 8 www.csi-india.org
modifications and usability enhancements
back to the Apache Hadoop community.
As a result, the Hadoop ecosystem grew
rapidly over the years.
Today, there are more than 20
components in the Hadoop ecosystem
that are developed as open source projects
under the Apache Software Foundation,
and several hundred proprietary and
other open source components. Some
of the popular components in the
Hadoop ecosystem, apart from Hadoop
Distributed File System (HDFS), and
MapReduce, include Hive, A SQL-like
language that translates to MapReduce;
Pig, an imperative data flow language that
generates MapReduce jobs to execute the
data flow; and HBase, a NoSQL Key-Value
store that uses HDFS as its persistent
layer. HBase is based on a paper describing
another Google infrastructure component,
Bigtable, which was published in 2006[8]
.
While Hadoop, today, has become
thedefactoplatformforanalyzingBigData,
challenges remain in making it accessible
and improving its ease of use, thus making
it a first-class citizen of data infrastructure
managed by IT professionals. The
MapReduce programming paradigm is not
particularly easy to use for data analysts,
and commonly used business intelligence
tools do not interoperate with interfaces
provided by Hadoop today. To overcome
these challenges, a number of data
warehousing system vendors, such as
Teradata, Oracle, IBM, EMC2
/Greenplum,
and others offer connectivity with Hadoop
platforms. There are even efforts towards
unifying SQL-based OLAP platforms, such
as Greenplum, with Hadoop[9]
. A number
of Hadoop distributions have emerged
over the years, improving manageability
of Hadoop infrastructure. These include
Cloudera, Hortonworks, MapR, EMC2
/
Greenplum, IBM BigInsights, Microsoft
HDInsights, etc. In addition, there is an
increasing number of Big Data Appliances;
hardware platforms that are integrated
with Hadoop distributions, including
Oracle, Teradata, and EMC2
/Greenplum.
Hadoop Adoption & Use Cases
Over the years, Hadoop and other big
data technologies have become popular
in non-Internet organizations as well,
also struggling to handle the data deluge.
Infrastructure in many organizations in
variousindustries,suchasretail,insurance,
healthcare, finance, manufacturing, and
others have been almost fully digitized.
Until recently, the data these organizations
used to collect were stored in archival
systems, mostly for regulatory compliance
purposes. However, there is a growing
realization across these organizations
that this data can be utilized for gaining
competitive advantage, increasing process
efficiencies, and improve customer
experience. In a recent study conducted
by Tata Consultancy Services (TCS)[10]
,
over 50% of organizations surveyed are
using Big Data technologies, and many of
them predicted more than 25% gains in
returns on investment (ROI), mostly from
increased revenue. The flexibility of these
Big Data systems to combine structured
datasets (51%) with semi-structured
datasets (49%) has been cited as enabling
advanced analytics capabilities. In
addition, while most of the organizations
use data that is available internally (70%)
within those organizations, availability of
external data, such as from twitter and
other social media, allows them to perform
better customer behavior analysis.
The 3V’s, volume, velocity, and
variety of data, along with need to develop
agile, data-driven applications, implies
that the humans analyzing, detecting
patterns, and making sense of data need
to have a rich toolset at hand. Traditional
data exploration, visualization, business
intelligence, and reporting tools are being
adapted to co-exist with these new Big
Data technologies. Advances in machine
learning algorithms and methods, as
well as abundant processing power,
have democratized deep and predictive
analytics to be used in any average IT
department. Open source languages
for statistical analysis and modeling,
such as the popular R language[11]
and
a newcomer such as Julia, as well as
emerging machine learning frameworks,
such as scikit-learn in Python[12]
, Apache
Mahout for Hadoop[13]
, and In-Database
deep analytics library, MADlib[14]
have
attracted attention of developers and
users for developing machine-learning
powered applications based on large and
diverse datasets.
These new platforms, languages
and frameworks have challenged several
predominant practices in the enterprises.
Traditional data governance practices,
including access control, provenance,
retention, backup, mirroring, disaster
recovery, security, and privacy, are
struggling to cope with organizations’
ability to store and process massive
amounts of diverse data. Over the next few
years, one should expect best practices
for data governance, and associated
technologies to emerge and become
commonplace.
Industrial Internet: The Next Frontier
While most of the Big Data use-cases
today are analyzing customer behavior,
their buying patterns, their likes and
dislikes as expressed in social media, their
clickstreams and location information
from mobile devices, machine-generated
data could be the next frontier for Big
Data systems. In addition, cheap sensor
technology, and short-range wireless
connectivity has created possibility of
real-time monitoring, and historical
pattern analysis of traditionally analog
informationsources. For example, a
modern Ford automobile has thousands of
signals being captured by 70+ sensors that
generate more than 25 gigabytes of data
every hour, and processed by 70 on-board
computers[15]
. While most of this data is
transient, and needs to be acted upon in
real-time, recognizing patterns within the
data to improve safety and usability of
the automobile implies aggregating and
analyzing it offline.
Indeed, the massive amount of data
captured by sensors in machinery, and
possibility of storing and analyzing this data
to make intelligent design and operational
decisions has created a new opportunity,
now known by a new moniker, Industrial
Internet[16]
. If, as a result of analyzing this
data to aid better decision making, we could
reduce system inefficiencies in healthcare
industry by a mere 1%, it could result in
savings of USD 63 Billion over next 15
years. If advanced analytics capabilities on
the large amount of oil and gas exploration
data results in only 1% of reduction in
capital expenditure, it could save more than
USD 90 Billion over next 15 years. The key
element proposed for the Industrial Internet
is Intelligent Connected Machines with
advanced sensors for data capture, controls
for automation, and software applications
powered by deep physics-based analytics
and predictive algorithms for analyzing large
amounts of sensor and telemetry data.
Indeed, we are witnessing the
third revolution, following the industrial
revolution, and the Internet revolution,
CSI Communications | April 2013 | 9
of the Industrial Internet, powered by
Big Data.
References
[1] IT Glossary, Gartner Inc, http://www.
gartner.com/it-glossary/big-data/
[2] The Digital Universe in 2020: Big Data,
Bigger Digital Shadows, and Biggest
Growth in the Far East, http://www.emc.
com/leadership/digital-universe/iview/
index.htm, December 2012
[3] Metcalfe’s Law Recurses Down the Long
Tail of Social Networks, http://vcmike.
wordpress.com/2006/08/18/metcalfe-
social-networks/, April 2006
[4] We knew the web was big, http://
googleblog.blogspot.com/2008/07/we-
knew-web-was-big.html, July 2008
[5] The Google File System, http://research.
google.com/archive/gfs.html, October
2003.
[6] MapReduce: Simplified Data Processing
on Large Clusters, http://research.google.
com/archive/mapreduce.html, December
2004
[7] The History of Hadoop: From 4 Nodes
to the Future of Data, http://gigaom.
com/2013/03/04/the-history-of-
hadoop-from-4-nodes-to-the-future-of-
data/, March 2013.
[8] Bigtable: A Distributed Storage System for
Structured Data, http://research.google.
com/archive/bigtable.html, November
2006.
[9] HAWQ: The New Benchmark for SQL on
Hadoop, http://www.greenplum.com/
blog/dive-in/hawq-the-new-benchmark-
for-sql-on-hadoop, February 2013.
[10] The Emerging Returns on Big Data, http://
www.tcs.com/big-data-study/Pages/
default.aspx, March 2013.
[11] The R Project for Statistical Computing,
http://www.r-project.org/.
[12] Scikit-learn: machine learning in Python,
http://scikit-learn.org/stable/.
[13] Apache Mahout, http://mahout.apache.
org/.
[14] MADlib, http://madlib.net/.
[15] Sensing the Future: Ford issues Predictions
for the next wave of Automotive
Electronics Innovation, http://media.
ford.com/article_display.cfm?article_
id=37541, December 2012.
[16] Industrial Internet: Pushing the
Boundaries of Minds and Machines,
http://www.ge.com/docs/chapters/
Industrial_Internet.pdf, November
2012 n
AbouttheAuthor
Dr. Milind Bhandarkar was the founding member of the team at Yahoo that took Apache Hadoop from 20-node
prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version
0.1. He started the Yahoo Grid solutions team focused on training, consulting, and supporting hundreds of new
migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years,
and a topic of his PhD dissertation at University of Illinois at Urbana-Champaign. He worked at the Centre for
Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center
for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo and LinkedIn.
Currently, he is the Chief Scientist at Greenplum, a division of EMC2
.

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 
06. 9534 14985-1-ed b edit dhyan
06. 9534 14985-1-ed b edit dhyan06. 9534 14985-1-ed b edit dhyan
06. 9534 14985-1-ed b edit dhyanIAESIJEECS
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
US National Archives & Open Government Data
US National Archives & Open Government DataUS National Archives & Open Government Data
US National Archives & Open Government Data3 Round Stones
 
Delivering on Standards for Publishing Government Linked Data
Delivering on Standards for Publishing Government Linked DataDelivering on Standards for Publishing Government Linked Data
Delivering on Standards for Publishing Government Linked Data3 Round Stones
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor networkparry prabhu
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 
High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud securitycsandit
 
US EPA OSWER Linked Data Workshop 1-Feb-2013
US EPA OSWER Linked Data Workshop 1-Feb-2013US EPA OSWER Linked Data Workshop 1-Feb-2013
US EPA OSWER Linked Data Workshop 1-Feb-20133 Round Stones
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computingIJNSA Journal
 
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...AnthonyOtuonye
 
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Cisco Service Provider Mobility
 

Was ist angesagt? (20)

Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 
06. 9534 14985-1-ed b edit dhyan
06. 9534 14985-1-ed b edit dhyan06. 9534 14985-1-ed b edit dhyan
06. 9534 14985-1-ed b edit dhyan
 
Big data
Big data Big data
Big data
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
BigData
BigDataBigData
BigData
 
Big data ankita1
Big data ankita1Big data ankita1
Big data ankita1
 
US National Archives & Open Government Data
US National Archives & Open Government DataUS National Archives & Open Government Data
US National Archives & Open Government Data
 
Delivering on Standards for Publishing Government Linked Data
Delivering on Standards for Publishing Government Linked DataDelivering on Standards for Publishing Government Linked Data
Delivering on Standards for Publishing Government Linked Data
 
Big data
Big dataBig data
Big data
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Big data survey
Big data surveyBig data survey
Big data survey
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud security
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
US EPA OSWER Linked Data Workshop 1-Feb-2013
US EPA OSWER Linked Data Workshop 1-Feb-2013US EPA OSWER Linked Data Workshop 1-Feb-2013
US EPA OSWER Linked Data Workshop 1-Feb-2013
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computing
 
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
 
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
 

Andere mochten auch

Insaat kursu-beylikduzu
Insaat kursu-beylikduzuInsaat kursu-beylikduzu
Insaat kursu-beylikduzusersld54
 
Insaat kursu-sultanbeyli
Insaat kursu-sultanbeyliInsaat kursu-sultanbeyli
Insaat kursu-sultanbeylisersld54
 
วิทยาศาสตร์
วิทยาศาสตร์วิทยาศาสตร์
วิทยาศาสตร์jojowhisky
 
Tues wed reformation plays
Tues wed reformation playsTues wed reformation plays
Tues wed reformation playsTravis Klein
 
Insaat kursu-antalya
Insaat kursu-antalyaInsaat kursu-antalya
Insaat kursu-antalyasersld54
 
Insaat kursu-gaziosmanpasa
Insaat kursu-gaziosmanpasaInsaat kursu-gaziosmanpasa
Insaat kursu-gaziosmanpasasersld54
 

Andere mochten auch (12)

Photo album
Photo albumPhoto album
Photo album
 
Mayrikis voski dzerqer
Mayrikis voski dzerqerMayrikis voski dzerqer
Mayrikis voski dzerqer
 
Csc short
Csc shortCsc short
Csc short
 
Insaat kursu-beylikduzu
Insaat kursu-beylikduzuInsaat kursu-beylikduzu
Insaat kursu-beylikduzu
 
Insaat kursu-sultanbeyli
Insaat kursu-sultanbeyliInsaat kursu-sultanbeyli
Insaat kursu-sultanbeyli
 
วิทยาศาสตร์
วิทยาศาสตร์วิทยาศาสตร์
วิทยาศาสตร์
 
Tues wed reformation plays
Tues wed reformation playsTues wed reformation plays
Tues wed reformation plays
 
Insaat kursu-antalya
Insaat kursu-antalyaInsaat kursu-antalya
Insaat kursu-antalya
 
Insaat kursu-gaziosmanpasa
Insaat kursu-gaziosmanpasaInsaat kursu-gaziosmanpasa
Insaat kursu-gaziosmanpasa
 
We are at
We are atWe are at
We are at
 
Conditionals.
Conditionals.Conditionals.
Conditionals.
 
Client Awards 2015
Client Awards 2015Client Awards 2015
Client Awards 2015
 

Ähnlich wie Big Data Systems: Past, Present & (Possibly) Future with @techmilind

Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGSECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGIJNSA Journal
 
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONScscpconf
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET Journal
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...IJARIIT
 
Big data's impact on online marketing
Big data's impact on online marketingBig data's impact on online marketing
Big data's impact on online marketingPros Global Inc
 

Ähnlich wie Big Data Systems: Past, Present & (Possibly) Future with @techmilind (20)

Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Big Data
Big DataBig Data
Big Data
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGSECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
 
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...
 
Big data's impact on online marketing
Big data's impact on online marketingBig data's impact on online marketing
Big data's impact on online marketing
 

Mehr von EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Mehr von EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Big Data Systems: Past, Present & (Possibly) Future with @techmilind

  • 1. CSI Communications | April 2013 | 7 Big Data Systems: Past, Present & (possibly) Future Dr. Milind Bhandarkar Chief Scientist, Machine Learning Platforms, Greenplum, A Division of EMC2 Cover Story The data management industry has matured over the last three decades, primarily based on Relational Data Base Management Systems (RDBMS) technology. Even today, RDBMS systems power a majority of backend systems for online digital media, financial systems, insurance, healthcare, transportation, and telecommunications companies. Since the amount of data collected, and analyzed in enterprises has increased several-folds in volume, variety, and velocity of generation and consumption, organizations have started struggling with architectural limitations of traditional RDBMS architectures. As a result, a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this article, we will trace the origin and history of this new class of systems to handle “Big Data”. We refer to current popular big data systems, exemplified by Hadoop, and discuss some current and future use-cases of these systems. What is Big Data? While there is no universally accepted definition of Big Data yet, and most of the attention in the press is devoted to the “Bigness” of Big Data, volume of data is only one factor in the requirements of modern data processing platforms. Industry analyst firm Gartner[1] defines Big Data as: Big data is high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making. A recent IDC study, sponsored by EMC2[2] , predicts that the “digital universe”, the data that is generated in digital form by humankind, will double every two years, and will reach 40,000 exabytes (40 * 1021 bytes) by 2020. A major driving factor behind this data growth is ubiquitous connectivity via rapidly growing reach of mobile devices, constantly connected to the networks. What is even more remarkable, is thatonly a small portion of this digital universe is “visible”, which is the data (videos, pictures, documents, status updates, tweets) created and consumed by consumers. A vast amount of data will be created not “by” human users, but “about” humans by the digital universe, and it will be stored, managed, and analyzed by the enterprises, such as Internet service providers, and cloud service providers of all varieties (Infrastructure-as-a-service, Platform-as-a-Service, and Software-as- a-Service.) Origins of Big Data Infrastructure We already notice this rapid growth of data generation in the online world around us. Facebook has grown from one Million users in 2004, to more than one Billion in 2012, a thousand-fold increase in less than eight years. More than 60% of these users access Facebook from mobile phones today. The value generated by a social network is proportional to the number of contacts between users of the social network, rather than the number of users. According to Metcalfe’s Law[3] , and its variants, the number of contacts for N users is proportional to N*logN. Thus, the growth of contacts, and therefore the interactions within a social network, which results in data generation, is non- linear with respect to number of users. As the world gets more connected, one can expect the number of interactions to grow, resulting in even more accelerated data growth. Since the popularity of Internet was one of the main reasons for growth of communication and connectivity in the world, we saw emergence of Big Data platforms in the Internet industry. Google, founded in 1998 with the goal of organizing all the information in the world, became the dominant content discovery platform on the World Wide Web, trumping human-powered and semi-automated approaches, such as web portals and directories. The challenges Google faced in crawling the web, storing, indexing, ranking, and serving billions of web pages could not be solved with the existing data management systems economically.Amountofpubliclyavailable content on the web in Google’s search index exploded from 26 Million pages in 1998, to more than 1 Trillion in less than a decade[4] .In addition, this content was “multi-structured”, consisting of natural- language text, images, video, geo-spatial, and even renderings of structured data. In order to rapidly answer the search queries, with information ranked by relevance as well as timeliness, Google had to develop its infrastructure from scratch. In 2003 and 2004, Google published details of a part of its infrastructure, in particular, the Google File System (GFS)[5] , and MapReduce programming framework[6] . These two publications became the blueprint for Apache Hadoop, an open source framework that has become a de facto standard for big data platforms deployed today. Apache Hadoop TheGFSandMapReducepapersmotivated Doug Cutting, creator of an open-source search engine, Apache Lucene, to re- architect the content system of Lucene, called Nutch, to incorporate a distributed file system, and MapReduce programming framework for tasks of crawling, storing, ranking, and indexing web pages so that they could be served as search results by Lucene. These developments were noticed by engineers and executives at Yahoo, which was then struggling to scale its search backend infrastructure. Yahoo adopted Apache Hadoop in January 2006, and made significant contributions to make it a scalable and stable platform. Today, Yahoo has the largest footprint of Apache Hadoop, running more than 45,000 servers managing more than 370 Petabytes of data with Hadoop[7] . Being an open source system, licensed under the liberal Apache Software License, governed bytheApacheSoftwareFoundation,meant that Hadoop could be freely downloaded and deployed in any organization, modified and used without any hard requirement of having to contribute the changes back to open source. The scalability and flexibility of Apache Hadoop prompted growing Internet companies such as Facebook, Twitter, and LinkedIn to adopt it for their data infrastructure, and contribute
  • 2. CSI Communications | April 2013 | 8 www.csi-india.org modifications and usability enhancements back to the Apache Hadoop community. As a result, the Hadoop ecosystem grew rapidly over the years. Today, there are more than 20 components in the Hadoop ecosystem that are developed as open source projects under the Apache Software Foundation, and several hundred proprietary and other open source components. Some of the popular components in the Hadoop ecosystem, apart from Hadoop Distributed File System (HDFS), and MapReduce, include Hive, A SQL-like language that translates to MapReduce; Pig, an imperative data flow language that generates MapReduce jobs to execute the data flow; and HBase, a NoSQL Key-Value store that uses HDFS as its persistent layer. HBase is based on a paper describing another Google infrastructure component, Bigtable, which was published in 2006[8] . While Hadoop, today, has become thedefactoplatformforanalyzingBigData, challenges remain in making it accessible and improving its ease of use, thus making it a first-class citizen of data infrastructure managed by IT professionals. The MapReduce programming paradigm is not particularly easy to use for data analysts, and commonly used business intelligence tools do not interoperate with interfaces provided by Hadoop today. To overcome these challenges, a number of data warehousing system vendors, such as Teradata, Oracle, IBM, EMC2 /Greenplum, and others offer connectivity with Hadoop platforms. There are even efforts towards unifying SQL-based OLAP platforms, such as Greenplum, with Hadoop[9] . A number of Hadoop distributions have emerged over the years, improving manageability of Hadoop infrastructure. These include Cloudera, Hortonworks, MapR, EMC2 / Greenplum, IBM BigInsights, Microsoft HDInsights, etc. In addition, there is an increasing number of Big Data Appliances; hardware platforms that are integrated with Hadoop distributions, including Oracle, Teradata, and EMC2 /Greenplum. Hadoop Adoption & Use Cases Over the years, Hadoop and other big data technologies have become popular in non-Internet organizations as well, also struggling to handle the data deluge. Infrastructure in many organizations in variousindustries,suchasretail,insurance, healthcare, finance, manufacturing, and others have been almost fully digitized. Until recently, the data these organizations used to collect were stored in archival systems, mostly for regulatory compliance purposes. However, there is a growing realization across these organizations that this data can be utilized for gaining competitive advantage, increasing process efficiencies, and improve customer experience. In a recent study conducted by Tata Consultancy Services (TCS)[10] , over 50% of organizations surveyed are using Big Data technologies, and many of them predicted more than 25% gains in returns on investment (ROI), mostly from increased revenue. The flexibility of these Big Data systems to combine structured datasets (51%) with semi-structured datasets (49%) has been cited as enabling advanced analytics capabilities. In addition, while most of the organizations use data that is available internally (70%) within those organizations, availability of external data, such as from twitter and other social media, allows them to perform better customer behavior analysis. The 3V’s, volume, velocity, and variety of data, along with need to develop agile, data-driven applications, implies that the humans analyzing, detecting patterns, and making sense of data need to have a rich toolset at hand. Traditional data exploration, visualization, business intelligence, and reporting tools are being adapted to co-exist with these new Big Data technologies. Advances in machine learning algorithms and methods, as well as abundant processing power, have democratized deep and predictive analytics to be used in any average IT department. Open source languages for statistical analysis and modeling, such as the popular R language[11] and a newcomer such as Julia, as well as emerging machine learning frameworks, such as scikit-learn in Python[12] , Apache Mahout for Hadoop[13] , and In-Database deep analytics library, MADlib[14] have attracted attention of developers and users for developing machine-learning powered applications based on large and diverse datasets. These new platforms, languages and frameworks have challenged several predominant practices in the enterprises. Traditional data governance practices, including access control, provenance, retention, backup, mirroring, disaster recovery, security, and privacy, are struggling to cope with organizations’ ability to store and process massive amounts of diverse data. Over the next few years, one should expect best practices for data governance, and associated technologies to emerge and become commonplace. Industrial Internet: The Next Frontier While most of the Big Data use-cases today are analyzing customer behavior, their buying patterns, their likes and dislikes as expressed in social media, their clickstreams and location information from mobile devices, machine-generated data could be the next frontier for Big Data systems. In addition, cheap sensor technology, and short-range wireless connectivity has created possibility of real-time monitoring, and historical pattern analysis of traditionally analog informationsources. For example, a modern Ford automobile has thousands of signals being captured by 70+ sensors that generate more than 25 gigabytes of data every hour, and processed by 70 on-board computers[15] . While most of this data is transient, and needs to be acted upon in real-time, recognizing patterns within the data to improve safety and usability of the automobile implies aggregating and analyzing it offline. Indeed, the massive amount of data captured by sensors in machinery, and possibility of storing and analyzing this data to make intelligent design and operational decisions has created a new opportunity, now known by a new moniker, Industrial Internet[16] . If, as a result of analyzing this data to aid better decision making, we could reduce system inefficiencies in healthcare industry by a mere 1%, it could result in savings of USD 63 Billion over next 15 years. If advanced analytics capabilities on the large amount of oil and gas exploration data results in only 1% of reduction in capital expenditure, it could save more than USD 90 Billion over next 15 years. The key element proposed for the Industrial Internet is Intelligent Connected Machines with advanced sensors for data capture, controls for automation, and software applications powered by deep physics-based analytics and predictive algorithms for analyzing large amounts of sensor and telemetry data. Indeed, we are witnessing the third revolution, following the industrial revolution, and the Internet revolution,
  • 3. CSI Communications | April 2013 | 9 of the Industrial Internet, powered by Big Data. References [1] IT Glossary, Gartner Inc, http://www. gartner.com/it-glossary/big-data/ [2] The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, http://www.emc. com/leadership/digital-universe/iview/ index.htm, December 2012 [3] Metcalfe’s Law Recurses Down the Long Tail of Social Networks, http://vcmike. wordpress.com/2006/08/18/metcalfe- social-networks/, April 2006 [4] We knew the web was big, http:// googleblog.blogspot.com/2008/07/we- knew-web-was-big.html, July 2008 [5] The Google File System, http://research. google.com/archive/gfs.html, October 2003. [6] MapReduce: Simplified Data Processing on Large Clusters, http://research.google. com/archive/mapreduce.html, December 2004 [7] The History of Hadoop: From 4 Nodes to the Future of Data, http://gigaom. com/2013/03/04/the-history-of- hadoop-from-4-nodes-to-the-future-of- data/, March 2013. [8] Bigtable: A Distributed Storage System for Structured Data, http://research.google. com/archive/bigtable.html, November 2006. [9] HAWQ: The New Benchmark for SQL on Hadoop, http://www.greenplum.com/ blog/dive-in/hawq-the-new-benchmark- for-sql-on-hadoop, February 2013. [10] The Emerging Returns on Big Data, http:// www.tcs.com/big-data-study/Pages/ default.aspx, March 2013. [11] The R Project for Statistical Computing, http://www.r-project.org/. [12] Scikit-learn: machine learning in Python, http://scikit-learn.org/stable/. [13] Apache Mahout, http://mahout.apache. org/. [14] MADlib, http://madlib.net/. [15] Sensing the Future: Ford issues Predictions for the next wave of Automotive Electronics Innovation, http://media. ford.com/article_display.cfm?article_ id=37541, December 2012. [16] Industrial Internet: Pushing the Boundaries of Minds and Machines, http://www.ge.com/docs/chapters/ Industrial_Internet.pdf, November 2012 n AbouttheAuthor Dr. Milind Bhandarkar was the founding member of the team at Yahoo that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1. He started the Yahoo Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years, and a topic of his PhD dissertation at University of Illinois at Urbana-Champaign. He worked at the Centre for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo and LinkedIn. Currently, he is the Chief Scientist at Greenplum, a division of EMC2 .