SlideShare ist ein Scribd-Unternehmen logo
1 von 60
‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬
‫چگونه؟‬
VAHID AMIRI
VAHIDAMIRY.IR
VAHID.AMIRY@GMAIL.COM
Big DataData Data Processing
Data Gathering
Data Storing
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
Big Data: 3V’s
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Volume
Variety (Complexity)
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
A Single View to the Customer
Customer
Social
Media
Gaming
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchase
Velocity (Speed)
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
Social media and networks
(all of us are generating data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Some Make it 4V’s
 The Model of Generating/Consuming Data has Changed
The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Solution
Big
Data
Big
Comput
ation
Big
Computer
Big Data Solutions
 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop
Spark Stack
 More than just the Elephant in the room
 Over 120+ types of NoSQL databases
So many NoSQL options
 Extend the Scope of RDBMS
 Caching
 Master/Slave
 Table Partitioning
 Federated Tables
 Sharding
NoSql
 Relational database (RDBMS) technology
 Has not fundamentally changed in over 40 years
 Default choice for holding data behind many web apps
 Handling more users means adding a bigger server
RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement
CAP Theorem
 “Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”
“Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
 CA
 Highly-available consistency
 CP
 Enforced consistency
 AP
 Eventual consistency
CAP Theorem
Flavors of NoSQL
 Schema-less
 State (Persistent or Volatile)
 Example:
 Redis
 Amazon DynamoDB
Key / Value Database
 Wide, sparse column sets
 Schema-light
 Examples:
 Cassandra
 HBase
 BigTable
 GAE HR DS
Column Database
 Use for data that is
 document-oriented (collection of JSON documents) w/semi structured
data
 Encodings include XML, YAML, JSON & BSON
 binary forms
 PDF, Microsoft Office documents -- Word, Excel…)
 Examples: MongoDB, CouchDB
Document Database
Graph Database
Use for data with
 a lot of many-to-many relationships
 when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
 Examples: Neo4J, FreeBase (Google)
So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL
Apache Hadoop Projects
Apache Hadoop
 A framework for storing & processing Petabyte of data using commodity hardware
and storage
 Apache project
 Implemented in Java
 Community of contributors is growing
 Yahoo: HDFS and MapReduce
 Powerset: HBase
 Facebook: Hive and FairShare scheduler
 IBM: Eclipse plugins
Briefing history of Hadoop
Organization used hadoop
Hadoop System Principles
 Scale-Out rather than Scale-Up
 Bring code to data rather than data to code
 Deal with failures – they are common
 Abstract complexity of distributed and concurrent applications
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software Layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
 Very easy to scale down as well
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by
high-capacity link
 Many data-intensive applications are not CPU demanding causing bottlenecks in
network
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage
Failures are Common
 Given a large number machines, failures are common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried
Abstract Complexity
 Hadoop abstracts many complexities in distributed and concurrent applications
 Defines small number of components
 Provides simple and well defined interfaces of interactions between these components
 Frees developer from worrying about system level challenges
 processing pipelines, data partitioning, code distribution
 Allows developers to focus on application development and business logic
Distribution Vendors
 Cloudera Distribution for Hadoop (CDH)
 MapR Distribution
 Hortonworks Data Platform (HDP)
 Apache BigTop Distribution
Components
 Distributed File System
 HDFS
 Distributed Processing Framework
 Map/Reduce
The Storage:
Hadoop Distributed File System
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware
HDFS Daemons
Files and Blocks
HDFS Component Communication
REPLICA MANGEMENT
 A common practice is to spread the nodes across multiple racks
 A good replica placement policy should improve data reliability, availability,
and network bandwidth utilization
 Namenode determines replica placement
NETWORK TOPOLOGY AND HADOOP
The Execution Engine:
Apache Yarn
Apache Yarn
Yarn Components
 RescourceManager:
 Arbitrates resources among all the applications in the
system
 NodeManager:
 the per-machine slave, which is responsible for launching
the applications’ containers, monitoring their resource
usage
 ApplicationMaster:
 Negotiate appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress
 Container:
 Unit of allocation incorporating resource elements such as
memory, cpu, disk, network etc., to execute a specific task of the
application (similar to map/reduce slots in MRv1)
YARN Architecture
The Processing Model:
MapReduce
Hadoop Mapreduce Framework
What is MapReduce?
 Parallel programming model for large clusters
 User implements Map() and Reduce()
 Parallel computing framework
 Libraries take care of EVERYTHING else
 Parallelization
 Fault Tolerance
 Data Distribution
 Load Balancing
 MapReduce library does most of the hard work for us!
 Takes care of distributed processing and coordination
 Scheduling
 Task Localization with Data
 Error Handling
 Data Synchronization
MapReduce: Data Flow
Map and Reduce
 Map()
 Map workers read in contents of corresponding input partition
 Process a key/value pair to generate intermediate key/value pairs
 Reduce()
 Merge all intermediate values associated with the same key
 eg. <key, [value1, value2,..., valueN]>
 Output of user's reduce function is written to output file on global file system
 When all tasks have completed, master wakes up user program
Distributed Processing
 Word count on a huge file
Mapreduce Model
Example: Counting Words
عصر کلان داده، چرا و چگونه؟

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 

Was ist angesagt? (20)

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 

Andere mochten auch

فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
 فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data TechnologiesEhsan Asgarian
 
Internet of Things Security Challlenges
Internet of Things Security ChalllengesInternet of Things Security Challlenges
Internet of Things Security Challlengesquickheal_co_ir
 
تشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان دادهتشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان دادهNavid Sedighpour
 
Big Data and select suitable tools
Big Data and select suitable toolsBig Data and select suitable tools
Big Data and select suitable toolsMeghdad Hatami
 
A Story of Big Data:Introduction
A Story of Big Data:IntroductionA Story of Big Data:Introduction
A Story of Big Data:IntroductionMobin Ranjbar
 
Big data بزرگ داده ها
Big data بزرگ داده هاBig data بزرگ داده ها
Big data بزرگ داده هاOmid Sohrabi
 
کلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آنکلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آنHamed Azizi
 
داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد Farzad Khandan
 
اینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقهاینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقهMahmood Neshati (PhD)
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranAmir Sedighi
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
3Com 69-001212-00
3Com 69-001212-003Com 69-001212-00
3Com 69-001212-00savomir
 
Tarea taller 4 mapa de conceptos isamalia muñiz
Tarea taller 4 mapa de conceptos   isamalia muñizTarea taller 4 mapa de conceptos   isamalia muñiz
Tarea taller 4 mapa de conceptos isamalia muñizIsamalia Muniz
 

Andere mochten auch (20)

فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
 فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
 
Internet of Things Security Challlenges
Internet of Things Security ChalllengesInternet of Things Security Challlenges
Internet of Things Security Challlenges
 
تشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان دادهتشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان داده
 
داده های جریانی streaming data
داده های جریانی streaming dataداده های جریانی streaming data
داده های جریانی streaming data
 
Big Data and select suitable tools
Big Data and select suitable toolsBig Data and select suitable tools
Big Data and select suitable tools
 
A Story of Big Data:Introduction
A Story of Big Data:IntroductionA Story of Big Data:Introduction
A Story of Big Data:Introduction
 
Big data بزرگ داده ها
Big data بزرگ داده هاBig data بزرگ داده ها
Big data بزرگ داده ها
 
کلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آنکلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آن
 
داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد
 
بیگ دیتا
بیگ دیتابیگ دیتا
بیگ دیتا
 
اینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقهاینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقه
 
Emilio aparicio
Emilio aparicioEmilio aparicio
Emilio aparicio
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACM
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
 
Gifted education
Gifted educationGifted education
Gifted education
 
Family Circle Presentation
Family Circle PresentationFamily Circle Presentation
Family Circle Presentation
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
3Com 69-001212-00
3Com 69-001212-003Com 69-001212-00
3Com 69-001212-00
 
Tarea taller 4 mapa de conceptos isamalia muñiz
Tarea taller 4 mapa de conceptos   isamalia muñizTarea taller 4 mapa de conceptos   isamalia muñiz
Tarea taller 4 mapa de conceptos isamalia muñiz
 

Ähnlich wie عصر کلان داده، چرا و چگونه؟

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoopdatabloginfo
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkGraisy Biswal
 

Ähnlich wie عصر کلان داده، چرا و چگونه؟ (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
HADOOP
HADOOPHADOOP
HADOOP
 

Kürzlich hochgeladen

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Kürzlich hochgeladen (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

عصر کلان داده، چرا و چگونه؟

  • 1. ‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬ ‫چگونه؟‬ VAHID AMIRI VAHIDAMIRY.IR VAHID.AMIRY@GMAIL.COM
  • 2. Big DataData Data Processing Data Gathering Data Storing
  • 3.
  • 4. Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 6. 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeveryday 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 Volume
  • 7. Variety (Complexity)  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF), …  Streaming Data  You can only scan the data once  Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together
  • 8. A Single View to the Customer Customer Social Media Gaming Entertain Bankin g Financ e Our Known History Purchase
  • 9. Velocity (Speed)  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities Social media and networks (all of us are generating data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 10. Some Make it 4V’s
  • 11.  The Model of Generating/Consuming Data has Changed The Model Has Changed… Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 14.  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster Hadoop
  • 15.
  • 17.  More than just the Elephant in the room  Over 120+ types of NoSQL databases So many NoSQL options
  • 18.  Extend the Scope of RDBMS  Caching  Master/Slave  Table Partitioning  Federated Tables  Sharding NoSql  Relational database (RDBMS) technology  Has not fundamentally changed in over 40 years  Default choice for holding data behind many web apps  Handling more users means adding a bigger server
  • 19. RDBMS with Extended Functionality Vs. Systems Built from Scratch with Scalability in Mind NoSQL Movement
  • 20. CAP Theorem  “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”
  • 21. “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”  CA  Highly-available consistency  CP  Enforced consistency  AP  Eventual consistency CAP Theorem
  • 23.  Schema-less  State (Persistent or Volatile)  Example:  Redis  Amazon DynamoDB Key / Value Database
  • 24.  Wide, sparse column sets  Schema-light  Examples:  Cassandra  HBase  BigTable  GAE HR DS Column Database
  • 25.  Use for data that is  document-oriented (collection of JSON documents) w/semi structured data  Encodings include XML, YAML, JSON & BSON  binary forms  PDF, Microsoft Office documents -- Word, Excel…)  Examples: MongoDB, CouchDB Document Database
  • 26. Graph Database Use for data with  a lot of many-to-many relationships  when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data  Examples: Neo4J, FreeBase (Google)
  • 27. So which type of NoSQL? Back to CAP… CP = noSQL/column Hadoop Big Table HBase MemCacheDB AP = noSQL/document or key/value DynamoDB CouchDB Cassandra Voldemort CA = SQL/RDBMS SQL Sever / SQL Azure Oracle MySQL
  • 28.
  • 30. Apache Hadoop  A framework for storing & processing Petabyte of data using commodity hardware and storage  Apache project  Implemented in Java  Community of contributors is growing  Yahoo: HDFS and MapReduce  Powerset: HBase  Facebook: Hive and FairShare scheduler  IBM: Eclipse plugins
  • 33. Hadoop System Principles  Scale-Out rather than Scale-Up  Bring code to data rather than data to code  Deal with failures – they are common  Abstract complexity of distributed and concurrent applications
  • 34. Scale-Out Instead of Scale-Up  It is harder and more expensive to scale-up  Add additional resources to an existing node (CPU, RAM)  New units must be purchased if required resources can not be added  Also known as scale vertically  Scale-Out  Add more nodes/machines to an existing distributed application  Software Layer is designed for node additions or removal  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system  Very easy to scale down as well
  • 35. Code to Data  Traditional data processing architecture  Nodes are broken up into separate processing and storage nodes connected by high-capacity link  Many data-intensive applications are not CPU demanding causing bottlenecks in network
  • 36. Code to Data  Hadoop co-locates processors and storage  Code is moved to data (size is tiny, usually in KBs)  Processors execute code and access underlying local storage
  • 37. Failures are Common  Given a large number machines, failures are common  Large warehouses may see machine failures weekly or even daily  Hadoop is designed to cope with node failures  Data is replicated  Tasks are retried
  • 38. Abstract Complexity  Hadoop abstracts many complexities in distributed and concurrent applications  Defines small number of components  Provides simple and well defined interfaces of interactions between these components  Frees developer from worrying about system level challenges  processing pipelines, data partitioning, code distribution  Allows developers to focus on application development and business logic
  • 39. Distribution Vendors  Cloudera Distribution for Hadoop (CDH)  MapR Distribution  Hortonworks Data Platform (HDP)  Apache BigTop Distribution
  • 40. Components  Distributed File System  HDFS  Distributed Processing Framework  Map/Reduce
  • 42. HDFS is Good for...  Storing large files  Terabytes, Petabytes, etc...  Millions rather than billions of files  100MB or more per file  Streaming data  Write once and read-many times patterns  Optimized for streaming reads rather than random reads  “Cheap” Commodity Hardware  No need for super-computers, use less reliable commodity hardware
  • 46. REPLICA MANGEMENT  A common practice is to spread the nodes across multiple racks  A good replica placement policy should improve data reliability, availability, and network bandwidth utilization  Namenode determines replica placement
  • 50. Yarn Components  RescourceManager:  Arbitrates resources among all the applications in the system  NodeManager:  the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage  ApplicationMaster:  Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress  Container:  Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc., to execute a specific task of the application (similar to map/reduce slots in MRv1)
  • 54. What is MapReduce?  Parallel programming model for large clusters  User implements Map() and Reduce()  Parallel computing framework  Libraries take care of EVERYTHING else  Parallelization  Fault Tolerance  Data Distribution  Load Balancing  MapReduce library does most of the hard work for us!  Takes care of distributed processing and coordination  Scheduling  Task Localization with Data  Error Handling  Data Synchronization
  • 56. Map and Reduce  Map()  Map workers read in contents of corresponding input partition  Process a key/value pair to generate intermediate key/value pairs  Reduce()  Merge all intermediate values associated with the same key  eg. <key, [value1, value2,..., valueN]>  Output of user's reduce function is written to output file on global file system  When all tasks have completed, master wakes up user program
  • 57. Distributed Processing  Word count on a huge file