SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
An introduction to
Big Data processing
using Hadoop
A.Sedighi
hexican.com
No	
  single	
  standard	
  definiHon…
“Big	
  Data”	
  is	
  data	
  whose	
  scale,	
  diversity,	
  
and	
  complexity	
  require	
  new	
  architecture,	
  
techniques,	
  algorithms,	
  and	
  analyHcs	
  to	
  
manage	
  it	
  and	
  extract	
  value	
  and	
  hidden	
  
knowledge	
  from	
  it…
Big Data, Definition
Information is powerful…
but it is how we use it that will
define us
Data Explosion
relational
text
audio
video
images
Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day
Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?
Big Data: 3 V's
Big Data: 3 V's
volume
velocity
variety
Some	
  Makes	
  it	
  3	
  V's
What	
  is	
  driving	
  Big	
  Data	
  Industry?	
  
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
Big Data Challenges
Big Data Challenges
Sorting of 10TB on:
1 node takes 2.5 Days O(N log N)
100 nodes takes 35 Mins O(log N)
Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.
What	
  Technology	
  Do	
  We	
  Have
For	
  Big	
  Data	
  ?
Map Reduce
MapReduce
Published in 2004 by Google
Popularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
Word	
  Count	
  Example
MapReduce philosophy
-hide complexity
-make it scalable
-make it cheap
MapReduce popularized by
Apache Hadoop project
Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.
Everything	
  Started	
  By	
  Searching
Hadoop was created by
Doug Cutting, the creator
of Apache Lucene, the
widely used text search
library. Hadoop has its
origins in Apache Nutch,
an open source web
search engine, itself a part
of the Lucene project.
Hadoop	
  Sub	
  Projects	
  -­‐	
  1
Hadoop	
  Sub	
  Projects	
  -­‐	
  2
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  1
HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
on commodity hardware.
-“Very large” in this context means files that are hundreds
of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of
data.
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  2
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. The
time to read the whole dataset is more important than the latency
in reading the first record.
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  3
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
Were	
  HDFS	
  doesn't	
  work	
  well?
● Low-­‐latency	
  data	
  access
● Lots	
  of	
  small	
  files
● MulHple	
  writers,	
  arbitrary	
  file	
  modificaHons.
MapReduce	
  and	
  HDFS	
  
HDFS Concepts - Blocks
65MB 128MB or 256MB Block size.
If the seek time is around 10ms, and the transfer rate is 100 MB/s,
then to make the seek time 1% of the transfer time, we need to
make the block size around 100 MB.
Anatomy	
  of	
  a	
  File	
  Read
Anatomy	
  of	
  a	
  File	
  Write
Replica Replacement
Machine Learning - 1
Mahout's	
  goal	
  is	
  to	
  build	
  scalable	
  machine	
  
learning	
  libraries	
  providing	
  core	
  algorithms	
  for	
  
clustering,	
  classificaHon	
  and	
  batch	
  based	
  
collaboraHve	
  filtering	
  are	
  implemented	
  on	
  top	
  
of	
  Apache	
  Hadoop	
  using	
  the	
  map/reduce	
  
paradigm.	
  
Machine Learning - 2
Mahout	
  can	
  be	
  used	
  as	
  a	
  recommender	
  engine	
  
on	
  the	
  top	
  of	
  hadoop	
  clusters.	
  
Using	
  hadoop	
  for
● ads and recomendations
● online travel
● processing mobile data
● energy savings and discovery
● infrastructure management
● image processing
● fraud detection
● IT security
● health care

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
An Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAn Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAmir Sedighi
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoopFrank Y
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017HashedIn Technologies
 
HDFS introduction
HDFS introductionHDFS introduction
HDFS introductioninjae yeo
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem GetInData
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 

Was ist angesagt? (20)

Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
An Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAn Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for Beginners
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HDFS introduction
HDFS introductionHDFS introduction
HDFS introduction
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 

Andere mochten auch

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Andere mochten auch (7)

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Ähnlich wie An introduction to Big-Data processing applying hadoop

Ähnlich wie An introduction to Big-Data processing applying hadoop (20)

Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Hadoop
HadoopHadoop
Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
paper
paperpaper
paper
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
HDFS
HDFSHDFS
HDFS
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 

Mehr von Amir Sedighi

آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگآشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 7 @ UTACM
Big Data and Machine Learning Workshop - Day 7 @ UTACM Big Data and Machine Learning Workshop - Day 7 @ UTACM
Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 6 @ UTACM
Big Data and Machine Learning Workshop - Day 6 @ UTACMBig Data and Machine Learning Workshop - Day 6 @ UTACM
Big Data and Machine Learning Workshop - Day 6 @ UTACMAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 5 @ UTACM
Big Data and Machine Learning Workshop - Day 5 @ UTACMBig Data and Machine Learning Workshop - Day 5 @ UTACM
Big Data and Machine Learning Workshop - Day 5 @ UTACMAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 4 @ UTACM
Big Data and Machine Learning Workshop - Day 4 @ UTACM Big Data and Machine Learning Workshop - Day 4 @ UTACM
Big Data and Machine Learning Workshop - Day 4 @ UTACM Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 2 @ UTACM
Big Data and Machine Learning Workshop - Day 2 @ UTACMBig Data and Machine Learning Workshop - Day 2 @ UTACM
Big Data and Machine Learning Workshop - Day 2 @ UTACMAmir Sedighi
 
Big Data and Machine Learning Workshop - Day 1 @ UTACM
Big Data and Machine Learning Workshop - Day 1 @ UTACMBig Data and Machine Learning Workshop - Day 1 @ UTACM
Big Data and Machine Learning Workshop - Day 1 @ UTACMAmir Sedighi
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranAmir Sedighi
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionAmir Sedighi
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi
 
Case Studies on Big-Data Processing and Streaming - Iranian Java User Group
Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupCase Studies on Big-Data Processing and Streaming - Iranian Java User Group
Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupAmir Sedighi
 
Opensource Frameworks and BigData Processing
Opensource Frameworks and BigData ProcessingOpensource Frameworks and BigData Processing
Opensource Frameworks and BigData ProcessingAmir Sedighi
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Amir Sedighi
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUAmir Sedighi
 

Mehr von Amir Sedighi (17)

Dark data
Dark dataDark data
Dark data
 
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگآشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
 
Big Data and Machine Learning Workshop - Day 7 @ UTACM
Big Data and Machine Learning Workshop - Day 7 @ UTACM Big Data and Machine Learning Workshop - Day 7 @ UTACM
Big Data and Machine Learning Workshop - Day 7 @ UTACM
 
Big Data and Machine Learning Workshop - Day 6 @ UTACM
Big Data and Machine Learning Workshop - Day 6 @ UTACMBig Data and Machine Learning Workshop - Day 6 @ UTACM
Big Data and Machine Learning Workshop - Day 6 @ UTACM
 
Big Data and Machine Learning Workshop - Day 5 @ UTACM
Big Data and Machine Learning Workshop - Day 5 @ UTACMBig Data and Machine Learning Workshop - Day 5 @ UTACM
Big Data and Machine Learning Workshop - Day 5 @ UTACM
 
Big Data and Machine Learning Workshop - Day 4 @ UTACM
Big Data and Machine Learning Workshop - Day 4 @ UTACM Big Data and Machine Learning Workshop - Day 4 @ UTACM
Big Data and Machine Learning Workshop - Day 4 @ UTACM
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACM
 
Big Data and Machine Learning Workshop - Day 2 @ UTACM
Big Data and Machine Learning Workshop - Day 2 @ UTACMBig Data and Machine Learning Workshop - Day 2 @ UTACM
Big Data and Machine Learning Workshop - Day 2 @ UTACM
 
Big Data and Machine Learning Workshop - Day 1 @ UTACM
Big Data and Machine Learning Workshop - Day 1 @ UTACMBig Data and Machine Learning Workshop - Day 1 @ UTACM
Big Data and Machine Learning Workshop - Day 1 @ UTACM
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015
 
Case Studies on Big-Data Processing and Streaming - Iranian Java User Group
Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupCase Studies on Big-Data Processing and Streaming - Iranian Java User Group
Case Studies on Big-Data Processing and Streaming - Iranian Java User Group
 
Opensource Frameworks and BigData Processing
Opensource Frameworks and BigData ProcessingOpensource Frameworks and BigData Processing
Opensource Frameworks and BigData Processing
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
 

Kürzlich hochgeladen

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 

Kürzlich hochgeladen (20)

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

An introduction to Big-Data processing applying hadoop

  • 1. An introduction to Big Data processing using Hadoop A.Sedighi hexican.com
  • 2. No  single  standard  definiHon… “Big  Data”  is  data  whose  scale,  diversity,   and  complexity  require  new  architecture,   techniques,  algorithms,  and  analyHcs  to   manage  it  and  extract  value  and  hidden   knowledge  from  it… Big Data, Definition
  • 3. Information is powerful… but it is how we use it that will define us
  • 5. Big Data Era -creates over 30 billion pieces of content per day -stores 30 petabytes of data -produces over 90 million tweets per day
  • 6. Log Files -Log files contains data. -Each banking transaction should be logged in different levels. How much a Banking solution generates log files per a day?
  • 8. Big Data: 3 V's volume velocity variety
  • 9. Some  Makes  it  3  V's
  • 10. What  is  driving  Big  Data  Industry?   - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets
  • 12. Big Data Challenges Sorting of 10TB on: 1 node takes 2.5 Days O(N log N) 100 nodes takes 35 Mins O(log N)
  • 13. Big Data Challenges Problem: “Fat” servers implies high cost. Solution: Using cheap commodity nodes instead. Problem: Large number of cheap nodes implies often failures. Solution: leverage automatic fault-tolerance
  • 14. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines.
  • 15. What  Technology  Do  We  Have For  Big  Data  ?
  • 16.
  • 18. MapReduce Published in 2004 by Google Popularized by Apache Hadoop project. Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
  • 20. MapReduce philosophy -hide complexity -make it scalable -make it cheap
  • 22. Hadoop Overview Open source implementation of Google MapReduce Google File System (GFS) First release in 2008 by Yahoo! Wide adoption by Facebook, Twitter, Amazon, etc.
  • 23.
  • 24. Everything  Started  By  Searching Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
  • 25. Hadoop  Sub  Projects  -­‐  1
  • 26. Hadoop  Sub  Projects  -­‐  2
  • 27. Hadoop  Distributed  File  System   (HDFS)  -­‐  1 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
  • 28. Hadoop  Distributed  File  System   (HDFS)  -­‐  2 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.
  • 29. Hadoop  Distributed  File  System   (HDFS)  -­‐  3 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
  • 30. Were  HDFS  doesn't  work  well? ● Low-­‐latency  data  access ● Lots  of  small  files ● MulHple  writers,  arbitrary  file  modificaHons.
  • 32. HDFS Concepts - Blocks 65MB 128MB or 256MB Block size. If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.
  • 33. Anatomy  of  a  File  Read
  • 34. Anatomy  of  a  File  Write
  • 35.
  • 37. Machine Learning - 1 Mahout's  goal  is  to  build  scalable  machine   learning  libraries  providing  core  algorithms  for   clustering,  classificaHon  and  batch  based   collaboraHve  filtering  are  implemented  on  top   of  Apache  Hadoop  using  the  map/reduce   paradigm.  
  • 38. Machine Learning - 2 Mahout  can  be  used  as  a  recommender  engine   on  the  top  of  hadoop  clusters.  
  • 39. Using  hadoop  for ● ads and recomendations ● online travel ● processing mobile data ● energy savings and discovery ● infrastructure management ● image processing ● fraud detection ● IT security ● health care