SlideShare ist ein Scribd-Unternehmen logo
1 von 53
big data - an overview
Arvind Kalyan
Engineer at LinkedIn
What makes data big?
Size?
100 GB of 1980 U.S. Census database was
considered ‘big’ at the time and required
sophisticated machinery
http://www.columbia.edu/cu/computinghistory/mss.html
Size?
7 billion people on earth…
storing name (~20 chars) , age (7 bits), gender (1
bit) for everyone on earth:
7 billion * (20 * 8 + 7 +1)/8 = 147GB
In the last 25 years the number hasn’t grown too
much and is still in the same order of magnitude..
Size?
i.e., cardinality of real-world data doesn't
change/grow very fast…. so that really is not that
much data…
Size?
Server with 128GB RAM and multiple TB of
disk space is not difficult to find in 2015
Size?
but observations on those entities can be too
many
Those 7 billion people interacting with: other
people, web-sites, products, etc at different
points in time and at different locations quickly
explodes in the number of data points
Analysis
‘big-data’ is a challenge when you want to analyze
of all those observations to identify trends &
patterns
RDBMS for performing
analysis
RDBMS these days can hold *much* larger
volumes on a single machine
RDBMS for performing
analysis
RDBMS is good at storing data & fetching
individual rows satisfying your query
RDBMS for performing
analysis
MySQL for example, when used right can
guarantee data durability and at the same time
provide some of the lowest read-latencies
RDBMS for performing
analysis
RDBMS can also be used for ad-hoc analysis
performance of such analysis can be improved by
sacrificing on ‘relational’ principles:
for example, by denormalizing tables (cost of
copy vs seek)
RDBMS for performing
analysis
But this doesn’t scale when you want to run the
analysis across all rows for every user
some queries can turn out to be running for
seconds or even days depending on the size of
the tables
RDBMS for performing
analysis
.. and then consider the overhead of doing this on
an on-going basis
RDBMS for performing
analysis
RDBMS is a not the right system if you want
to look at & process the full data-set
RDBMS for performing
analysis
These days data is an asset and businesses &
organizations need to extrapolate trends,
patterns, etc. from that data
What’s the solution?
.. if not RDBMS
“Numbers Everyone Should Know”
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
Let’s look closely at disk vs memory
Source: http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg
Let’s look closely at disk vs memory
in general:
disk is slower than SSD, and
SSD is slower than memory
but more importantly…
Let’s look closely at disk vs memory
Memory is not always faster than disk
SSD is not always faster than disk
.. and at network vs disk..
network is not always slower than disk
plus local machine and disk have limits but
network enables you to grow!
Lesson learned
access pattern is important
depending on data-set (size) & use-case, the
access pattern alone can make a difference
between possible & not possible!
The solution
more often, these days, it is better to have a
distributed system to crunch the data
one node gathers partial results from many other
nodes and produces final result
individual nodes are in-turn optimized to return
results quickly from their partial local data
big-data tech
‘big data’ technologies & tools help you define
your task in a higher order language, abstracting
out the details of distributed systems underneath
big-data tech
i.e., they encourage/force you to follow a certain
access pattern
big-data tech
these are still tools
following suggested best-practices help make the
best use of the tool
at the same time, following anti-patterns can make
the situation appear more challenging
big-data tech: DSLs
some of the most popular frameworks happen to
be DSLs that generate another set of
instructions that actually execute
pig, hive, scalding, cascading, etc.
big-data tech: DSLs
biggest challenge with these DSLs is getting
them to work, and long term maintenance
big-data tech: DSLs
when using DSLs, code is written in one
language and executed in another
if anything fails, the error message is usually
associated with the latter and you have to know
enough about the abstracted layer to be able to
translate it back to your code
big-data tech: DSLs
but it usually works for the most part, because..
some of these popular frameworks have active
user-groups and blog posts that’ll help you get
the job done
big-data tech: DSLs
so, more popular the technology, the better the
documentation & support
also, bigger the community, the better the likelihood of
evolution of the technology
so start with the most popular/common technology that
does the job; even if it means compromising some ‘cool’
feature provided by another, currently less popular tool
unless absolutely necessary
big-data tech: map/reduce
most of the current (as of Feb 2015) ‘big-data’
frameworks revolve around the map/reduce
paradigm
… and use hadoop technologies underneath
big-data tech: map/reduce
hadoop technologies for big data processing can
be seen as 2 major components
hdfs => distributed filesystem for storage
‘map/reduce’ => programming model
big-data tech: hdfs
hdfs stores data in immutable format
data is ‘partitioned’ & stored on different
machines
there are also multiple copies of each ‘part’ i.e.,
replicated
big-data tech: hdfs
‘partitioning’ enables faster processing in parallel
also helps with data locality: code can run where
data is and not the other way around
big-data tech: hdfs
‘replication’ increases availability of the data
itself
it also helps with overall performance &
availability of tasks running on it (speculating
execution): run the same task on multiple replicas,
and wait for one of them to finish
big-data tech: map/reduce
‘map/reduce’ is a functional programming concept
map => process & transform each set of data
points in parallel and outputs (key, value) pairs
reduce => gather those partial results and come
up with final result
big-data tech: map/reduce
‘map/reduce’ in hadoop also has one more
important step between map and reduce
shuffle: makes sure all values for a given ‘key’
ends up on the same reducer
big-data tech: map/reduce
‘map/reduce’ in hadoop also has a slightly
different ‘reduce’
reduce: values are aggregated per-key. Not
across the whole dataset
big-data tech: map/reduce
in general map/reduce shines for ‘embarrassingly
parallel’ problems
trying to run non-parallelizable jobs on hadoop
(like requiring global ordering, or something
similar) might work now, but may not scale in the
long run
http://en.wikipedia.org/wiki/Embarrassingly_parallel
big-data tech: map/reduce
But surprisingly a *lot* of ‘big-data’ problems can
be modeled directly on map/reduce with little to
no change
big-data tech: map/reduce
map/reduce on hadoop is a multi-tenant
distributed system running on disk-local data
non-interactive analysis
big-data analysis/processing has typically been
associated with non-interactive jobs. i.e., the user
doesn’t expect the results to come back in a few
seconds
the job usually takes a few mins to a few hours, or
even days
the need for speed
What if we made it run on in-memory data?
the need for speed
this is the current trend
spark, presto are some noteworthy examples
not based on map/reduce programming paradigm
but still take advantage of the underlying
distributed filesystem (hdfs)
faster big-data: spark
Spark is a Scala DSL
Resilient Distributed Datasets : primary
abstraction in Spark
RDDs: collections of data kept in-memory
Provides collections API comparable to Scala
lang that transform one RDD to another
faster big-data: spark
fault-tolerance: retries computation on certain
failures
so it’s also good for the ‘backend’ / scheduled
jobs in addition to interactive usage
faster big-data: spark
where it shines: complex, iterative algorithms like
in Machine learning.
Since RDDs can be cached in-memory and
used across the network, the computation
speeds up considerably in the absence of disk
I/O
faster big-data: presto
presto is from facebook; written in java
essentially a distributed read-only SQL query engine
designed specifically for interactive ad-hoc analysis
over Petabytes of data
data on disk, but processing pipeline is fully in-memory
faster big-data: presto
no fault-tolerance, but extremely fast
ideal for ad-hoc queries on extremely large data
that finish fast but not for long running or
scheduled jobs
as of today there is no UDF support
references
http://www.akkadia.org/drepper/cpumemory.pdf
http://static.googleusercontent.com/media/research.go
ogle.com/en/us/people/jeff/stanford-295-talk.pdf
http://queue.acm.org/detail.cfm?id=1563874
http://en.wikipedia.org/wiki/Apache_Hadoop
https://prestodb.io/
Leave questions & comments below, or reach out
through LinkedIn!
Arvind Kalyan
https://www.linkedin.com/in/base16

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big data trends challenges opportunities
Big data trends challenges opportunitiesBig data trends challenges opportunities
Big data trends challenges opportunitiesMohammed Guller
 

Was ist angesagt? (20)

Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big data trends challenges opportunities
Big data trends challenges opportunitiesBig data trends challenges opportunities
Big data trends challenges opportunities
 

Andere mochten auch

Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
BigData Overview
BigData OverviewBigData Overview
BigData OverviewHoryun Lee
 
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Prof. Dr. Diego Kuonen
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler Shengwen HOU(侯圣文)
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (20)

Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
BigData Overview
BigData OverviewBigData Overview
BigData Overview
 
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Big Data - An Overview

Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 

Ähnlich wie Big Data - An Overview (20)

Big Data
Big DataBig Data
Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data
Big dataBig data
Big data
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Kürzlich hochgeladen

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Kürzlich hochgeladen (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Big Data - An Overview

  • 1. big data - an overview Arvind Kalyan Engineer at LinkedIn
  • 3. Size? 100 GB of 1980 U.S. Census database was considered ‘big’ at the time and required sophisticated machinery http://www.columbia.edu/cu/computinghistory/mss.html
  • 4. Size? 7 billion people on earth… storing name (~20 chars) , age (7 bits), gender (1 bit) for everyone on earth: 7 billion * (20 * 8 + 7 +1)/8 = 147GB In the last 25 years the number hasn’t grown too much and is still in the same order of magnitude..
  • 5. Size? i.e., cardinality of real-world data doesn't change/grow very fast…. so that really is not that much data…
  • 6. Size? Server with 128GB RAM and multiple TB of disk space is not difficult to find in 2015
  • 7. Size? but observations on those entities can be too many Those 7 billion people interacting with: other people, web-sites, products, etc at different points in time and at different locations quickly explodes in the number of data points
  • 8. Analysis ‘big-data’ is a challenge when you want to analyze of all those observations to identify trends & patterns
  • 9. RDBMS for performing analysis RDBMS these days can hold *much* larger volumes on a single machine
  • 10. RDBMS for performing analysis RDBMS is good at storing data & fetching individual rows satisfying your query
  • 11. RDBMS for performing analysis MySQL for example, when used right can guarantee data durability and at the same time provide some of the lowest read-latencies
  • 12. RDBMS for performing analysis RDBMS can also be used for ad-hoc analysis performance of such analysis can be improved by sacrificing on ‘relational’ principles: for example, by denormalizing tables (cost of copy vs seek)
  • 13. RDBMS for performing analysis But this doesn’t scale when you want to run the analysis across all rows for every user some queries can turn out to be running for seconds or even days depending on the size of the tables
  • 14. RDBMS for performing analysis .. and then consider the overhead of doing this on an on-going basis
  • 15. RDBMS for performing analysis RDBMS is a not the right system if you want to look at & process the full data-set
  • 16. RDBMS for performing analysis These days data is an asset and businesses & organizations need to extrapolate trends, patterns, etc. from that data
  • 18. “Numbers Everyone Should Know” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns
  • 19. Let’s look closely at disk vs memory Source: http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg
  • 20. Let’s look closely at disk vs memory in general: disk is slower than SSD, and SSD is slower than memory but more importantly…
  • 21. Let’s look closely at disk vs memory Memory is not always faster than disk SSD is not always faster than disk
  • 22. .. and at network vs disk.. network is not always slower than disk plus local machine and disk have limits but network enables you to grow!
  • 23. Lesson learned access pattern is important depending on data-set (size) & use-case, the access pattern alone can make a difference between possible & not possible!
  • 24. The solution more often, these days, it is better to have a distributed system to crunch the data one node gathers partial results from many other nodes and produces final result individual nodes are in-turn optimized to return results quickly from their partial local data
  • 25. big-data tech ‘big data’ technologies & tools help you define your task in a higher order language, abstracting out the details of distributed systems underneath
  • 26. big-data tech i.e., they encourage/force you to follow a certain access pattern
  • 27. big-data tech these are still tools following suggested best-practices help make the best use of the tool at the same time, following anti-patterns can make the situation appear more challenging
  • 28. big-data tech: DSLs some of the most popular frameworks happen to be DSLs that generate another set of instructions that actually execute pig, hive, scalding, cascading, etc.
  • 29. big-data tech: DSLs biggest challenge with these DSLs is getting them to work, and long term maintenance
  • 30. big-data tech: DSLs when using DSLs, code is written in one language and executed in another if anything fails, the error message is usually associated with the latter and you have to know enough about the abstracted layer to be able to translate it back to your code
  • 31. big-data tech: DSLs but it usually works for the most part, because.. some of these popular frameworks have active user-groups and blog posts that’ll help you get the job done
  • 32. big-data tech: DSLs so, more popular the technology, the better the documentation & support also, bigger the community, the better the likelihood of evolution of the technology so start with the most popular/common technology that does the job; even if it means compromising some ‘cool’ feature provided by another, currently less popular tool unless absolutely necessary
  • 33. big-data tech: map/reduce most of the current (as of Feb 2015) ‘big-data’ frameworks revolve around the map/reduce paradigm … and use hadoop technologies underneath
  • 34. big-data tech: map/reduce hadoop technologies for big data processing can be seen as 2 major components hdfs => distributed filesystem for storage ‘map/reduce’ => programming model
  • 35. big-data tech: hdfs hdfs stores data in immutable format data is ‘partitioned’ & stored on different machines there are also multiple copies of each ‘part’ i.e., replicated
  • 36. big-data tech: hdfs ‘partitioning’ enables faster processing in parallel also helps with data locality: code can run where data is and not the other way around
  • 37. big-data tech: hdfs ‘replication’ increases availability of the data itself it also helps with overall performance & availability of tasks running on it (speculating execution): run the same task on multiple replicas, and wait for one of them to finish
  • 38. big-data tech: map/reduce ‘map/reduce’ is a functional programming concept map => process & transform each set of data points in parallel and outputs (key, value) pairs reduce => gather those partial results and come up with final result
  • 39. big-data tech: map/reduce ‘map/reduce’ in hadoop also has one more important step between map and reduce shuffle: makes sure all values for a given ‘key’ ends up on the same reducer
  • 40. big-data tech: map/reduce ‘map/reduce’ in hadoop also has a slightly different ‘reduce’ reduce: values are aggregated per-key. Not across the whole dataset
  • 41. big-data tech: map/reduce in general map/reduce shines for ‘embarrassingly parallel’ problems trying to run non-parallelizable jobs on hadoop (like requiring global ordering, or something similar) might work now, but may not scale in the long run http://en.wikipedia.org/wiki/Embarrassingly_parallel
  • 42. big-data tech: map/reduce But surprisingly a *lot* of ‘big-data’ problems can be modeled directly on map/reduce with little to no change
  • 43. big-data tech: map/reduce map/reduce on hadoop is a multi-tenant distributed system running on disk-local data
  • 44. non-interactive analysis big-data analysis/processing has typically been associated with non-interactive jobs. i.e., the user doesn’t expect the results to come back in a few seconds the job usually takes a few mins to a few hours, or even days
  • 45. the need for speed What if we made it run on in-memory data?
  • 46. the need for speed this is the current trend spark, presto are some noteworthy examples not based on map/reduce programming paradigm but still take advantage of the underlying distributed filesystem (hdfs)
  • 47. faster big-data: spark Spark is a Scala DSL Resilient Distributed Datasets : primary abstraction in Spark RDDs: collections of data kept in-memory Provides collections API comparable to Scala lang that transform one RDD to another
  • 48. faster big-data: spark fault-tolerance: retries computation on certain failures so it’s also good for the ‘backend’ / scheduled jobs in addition to interactive usage
  • 49. faster big-data: spark where it shines: complex, iterative algorithms like in Machine learning. Since RDDs can be cached in-memory and used across the network, the computation speeds up considerably in the absence of disk I/O
  • 50. faster big-data: presto presto is from facebook; written in java essentially a distributed read-only SQL query engine designed specifically for interactive ad-hoc analysis over Petabytes of data data on disk, but processing pipeline is fully in-memory
  • 51. faster big-data: presto no fault-tolerance, but extremely fast ideal for ad-hoc queries on extremely large data that finish fast but not for long running or scheduled jobs as of today there is no UDF support
  • 53. Leave questions & comments below, or reach out through LinkedIn! Arvind Kalyan https://www.linkedin.com/in/base16