SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Big Data and Hadoop Overview
What it Big Data?
Big data is a term that describes the large volume of data – both structured and unstructured – that
inundate a business on a day-to-day basis. In short, big data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.
Who Generates Big Data?
More and more data are being produced by an increasing number of electronic devices surrounding us
and on the internet. The amount of data and the frequency at whichthey are produced are so vast that
they are referred as “BIGData”.
Why Is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what you do with it.
You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time
reductions, 3) new product development and optimized offerings, and 4) smart decision making. When
you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
 Determining root causes of failures, issues and defects in near-real time.
 Generating coupons at the point of sale based on the customer’s buying habits.
 Recalculating entire risk portfoliosin minutes.
 Detecting fraudulent behaviour before it affectsyour organization.
Brief History of Big Data
While the term “big data” is relatively new, the act of gathering and storing large amounts of
information for eventual analysis is ages old. The concept gained momentum in the early 2000s when
industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:
Volume Organizations collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a
problem – but new technologies have eased the burden.
Velocity Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID
tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Now a day’s below two Vs got added
Variability In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-
triggered peak data loads can be challenging to manage.
Complexity Today's data comes from multiple sources, which makes it difficult to link, match, cleanse
and transform data across systems. However, it’s necessary to connect and correlate relationships,
hierarchies and multiple data linkages or your data can quickly spiral out of control.
Categories of 'Big Data'
Big data' couldbe found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured:Data stored in a relational database management system is one example of
a 'structured' data.
Unstructured:Output returned by 'GoogleSearch'.
Semi-structured: Personal data stored in a XML file.
Evolutionof Hadoop
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were
created to help locate relevant information amid the text-based content. In the early years, search
results were returned by humans. But as the web grew from dozens to millions of pages, automation
was needed. Web crawlers were created, many as university-led research projects, and search engine
start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting
and Mike Cafarella. They wanted to return web search results faster by distributing data and
calculations across different computers so multiple tasks could be accomplished simultaneously. During
this time, another search engine project called Google was in progress. It was based on the same
concept – storing and processing data in a distributed, automated way so that relevant web search
results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s
early work with automating distributed data storage and processing. The Nutch project was divided –
the web crawler portion remained as Nutch and the distributed computing and processing portion
became Hadoop. In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s
framework and ecosystem of technologies are managed and maintained by the non-profit Apache
Software Foundation (ASF), a global community of software developers and contributors.
FunFact: "Hadoop”was thenameof a yellow toy elephant owned by the son of one of its inventors.
Why is Hadoop important?
 Ability to store and process huge amounts of any kind of data, quickly. With data volumes and
varieties constantly increasing, especially from social media and the Internet of Things (IoT),
that's a key consideration.
 Computing power. Hadoop's distributed computing model processes big data fast. The more
computing nodes you use the more processing power you have.
 Fault tolerance. Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. Multiple copies of all data are stored automatically.
 Flexibility. Unlike traditional relational databases, you don’t have to pre-process data before
storing it. You can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.
 Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
 Scalability. You can easily grow your system to handle more data simply by adding nodes. Little
administration is required.
What are key component of Hadoop?
There are 3 core components of the Hadoop framework are:
 MapReduce– A software programming model forprocessing large sets of data in parallel
 HDFS – The Java-based distributed file system that can store all kinds of data withoutprior
organization.
 YARN– A resource management frameworkforscheduling and handling resource requests
from distributed applications.
Types of Hadoop installation
There are various ways in whichHadoop can be run. Here are the various scenarios in whichHadoop
can be downloaded, installed and run.
Standalone mode
Though Hadoop is a distributed platform for working with big data, we can even install Hadoop on a
single node in a single standalone instance. This way the entire Hadoop platform runs like a system
which is running on Java. This is mostly used for the purpose of debugging. It helps if you want to check
your mapreduce applications on a single node before running on a huge cluster of Hadoop.
Fully Distributed mode
This is distributed mode that has several nodes of commodity hardware connected to form the Hadoop
cluster. In such a setup the NameNode, JobTracker and Secondary NameNode work on the master node
whereas the Datanode and the secondarydatanode work on the slave node. The other set of nodes
namely the Datanode and the TaskTracker work on the slave node.
Pseudo distributed mode
This in effect is a single node Java system that runs the entire Hadoop cluster. So the various daemons
like the NameNode, Datanode, TaskTracker and JobTracker run on the single instance of the Java
machine to form the distributed Hadoop cluster.
Hadoop ecosystem
 HBaseA scalable distributed database that supports structured data storage forlarge tables.
 HiveA data warehouse infrastructure that provides data summarization and ad hoc querying.
 MahoutA Scalable machine learning and data mining library.
 PigA high-level data-flow language and execution frameworkforparallel computation.
 Flumeis a distributed, reliable, and available service forefficiently collecting,aggregating, and
moving large amounts of log data.
 Oozieis a workflow scheduler system to manage Apache Hadoop jobs.
 Sqoopis a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores such as relational databases.
 Zookeeper isan effortto develop and maintain an open-source server which enables highly
reliable distributed coordination.
 TheOracleR ConnectorforHadoop (ORCH) providesaccess to a Hadoop cluster from R,
enabling manipulation of HDFS-resident data and the execution of Mapreduce jobs.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big DataTeemu Heikkilä
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 

Was ist angesagt? (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 

Ähnlich wie Big data and Hadoop overview

Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoptionfaizrashid1995
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 

Ähnlich wie Big data and Hadoop overview (20)

Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop
HadoopHadoop
Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data
Big dataBig data
Big data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Big data
Big dataBig data
Big data
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 

Kürzlich hochgeladen

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Kürzlich hochgeladen (20)

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Big data and Hadoop overview

  • 1. Big Data and Hadoop Overview What it Big Data? Big data is a term that describes the large volume of data – both structured and unstructured – that inundate a business on a day-to-day basis. In short, big data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. Who Generates Big Data? More and more data are being produced by an increasing number of electronic devices surrounding us and on the internet. The amount of data and the frequency at whichthey are produced are so vast that they are referred as “BIGData”. Why Is Big Data Important? The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:  Determining root causes of failures, issues and defects in near-real time.  Generating coupons at the point of sale based on the customer’s buying habits.  Recalculating entire risk portfoliosin minutes.  Detecting fraudulent behaviour before it affectsyour organization.
  • 2. Brief History of Big Data While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs: Volume Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies have eased the burden. Velocity Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. Now a day’s below two Vs got added Variability In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event- triggered peak data loads can be challenging to manage. Complexity Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Categories of 'Big Data' Big data' couldbe found in three forms: 1. Structured 2. Unstructured 3. Semi-structured Structured:Data stored in a relational database management system is one example of a 'structured' data. Unstructured:Output returned by 'GoogleSearch'. Semi-structured: Personal data stored in a XML file.
  • 3. Evolutionof Hadoop As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.). One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster. In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop. In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors. FunFact: "Hadoop”was thenameof a yellow toy elephant owned by the son of one of its inventors.
  • 4. Why is Hadoop important?  Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.  Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use the more processing power you have.  Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.  Flexibility. Unlike traditional relational databases, you don’t have to pre-process data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.  Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.  Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. What are key component of Hadoop? There are 3 core components of the Hadoop framework are:  MapReduce– A software programming model forprocessing large sets of data in parallel  HDFS – The Java-based distributed file system that can store all kinds of data withoutprior organization.  YARN– A resource management frameworkforscheduling and handling resource requests from distributed applications.
  • 5. Types of Hadoop installation There are various ways in whichHadoop can be run. Here are the various scenarios in whichHadoop can be downloaded, installed and run. Standalone mode Though Hadoop is a distributed platform for working with big data, we can even install Hadoop on a single node in a single standalone instance. This way the entire Hadoop platform runs like a system which is running on Java. This is mostly used for the purpose of debugging. It helps if you want to check your mapreduce applications on a single node before running on a huge cluster of Hadoop. Fully Distributed mode This is distributed mode that has several nodes of commodity hardware connected to form the Hadoop cluster. In such a setup the NameNode, JobTracker and Secondary NameNode work on the master node whereas the Datanode and the secondarydatanode work on the slave node. The other set of nodes namely the Datanode and the TaskTracker work on the slave node. Pseudo distributed mode This in effect is a single node Java system that runs the entire Hadoop cluster. So the various daemons like the NameNode, Datanode, TaskTracker and JobTracker run on the single instance of the Java machine to form the distributed Hadoop cluster.
  • 6. Hadoop ecosystem  HBaseA scalable distributed database that supports structured data storage forlarge tables.  HiveA data warehouse infrastructure that provides data summarization and ad hoc querying.  MahoutA Scalable machine learning and data mining library.  PigA high-level data-flow language and execution frameworkforparallel computation.  Flumeis a distributed, reliable, and available service forefficiently collecting,aggregating, and moving large amounts of log data.  Oozieis a workflow scheduler system to manage Apache Hadoop jobs.  Sqoopis a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.  Zookeeper isan effortto develop and maintain an open-source server which enables highly reliable distributed coordination.  TheOracleR ConnectorforHadoop (ORCH) providesaccess to a Hadoop cluster from R, enabling manipulation of HDFS-resident data and the execution of Mapreduce jobs.