SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Big Data and Hadoop
Presented By: Maulik Lakhani
Session 1 – Introduction to Big Data and Hadoop
 About Me
 Outline
Introduction to Big Data
Traditional Data Processing
Introduction to Hadoop
HadoopArchitecture
Hadoop and RDBMS
Hadoop Distributions
 What is Big Data?
Collection of large datasets that cannot be processed using traditional
computing techniques.
Not a single technique or a tool, rather a complete subject. Various tools,
techniques and frameworks.
It’s not the amount of data that’s important. It’s what organisations do with
the data that matters.
Big data can be analysed for insights that lead to better decisions and
strategic business moves.
 Traditional Data Processing
For storage purpose, the programmers will take the help of their choice of database
vendors such as Oracle, IBM and others.
An enterprise will have a computer to store and process big data.
The user interacts with the application, which in turn handles the part of data
storage and analysis.
 Characteristics - Vs of Big Data
• Quality of the
data
• Structured, Un-
structured,
Semi-structured
• Periodic, Near-
time, Real-time
• Terabytes of
data,
Transactions,
Files
Volume Velocity
VeracityVariety
 Characteristics - Vs of Big Data
Volume - Size of data
• Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
Velocity
• There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is
an increase of 22% year-over-year.
 Characteristics - Vs of Big Data
Variety
• Data can be structured, semi-structured or unstructured in the form of images,
audios, videos, sensor data.
• Variety of data creates problems in capturing, storage, mining and analyzing the
data.
Veracity – data uncertainty, data inconsistency and incompleteness
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they
use to make decisions.
 Characteristics - Vs of Big Data
Veracity
• Poor data quality costs the US economy around $3.1 trillion a year.
Value
• Adding to the benefits of the organizations. Is the organization working on Big
Data achieving high ROI?
 Ms of Big Data
Make Me
MoneyMore
 What Comes Under Big Data?
• Black Box Data: Voices of the flight crew, performance information of the aircraft.
• Social Media Data: Posts and views by millions of people across the globe.
• Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the
stock exchange.
• Power Grid Data: Electricity consumed by a particular node with respect to a base
station.
• Transport Data: Shipping / freight data.
 Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to recommend
products.
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors.
 Applications of Big Data
• Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance.
• Telecom: Reduce data packet loss. Offer personalized plans.
• Retail: Recommendation engines - Suggestion based on the browsing history of the
consumer.
• Traffic control: managing traffic better via effective use of data and sensors.
• Manufacturing: reduce component defects, improve product quality, increase
efficiency, and save time and money.
• Search Quality: Personalised search results based on previous searches.
AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go.
AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously
learned knowledge.
It gains knowledge by machine learning, specifically by an artificial neural network from
extensive training, both from human and computer play.
In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for
the first time.
Google Photos
Google implements different forms of machine learning into the Photos service,
particularly in recognition of photo contents.
People: Photos app collects all the photos containing faces. It doesn’t identify these
people, but just collects them for quick access
Places: The relies on landmarks. It can correctly identify well-known places like Taj
Mahal.
Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and
cats.There are many more categories, including screenshots, posters and castles.
Not everything in the garden is rosy!
 Challenges with Big Data
• Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to
the companies each year in the United States.
• Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and
insights are very difficult.
• Storage: The more data an organization has, the more complex the problem of
managing it.
• Lack of Talent: A developers, data scientists and analysts who also have sufficient
amount of domain knowledge.
 Summary of Big Data
Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
 Introduction to Hadoop
Hadoop is an open source framework from Apache.
Used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical
processing).
Used for batch/offline processing.
 Introduction to Hadoop
 Modules of Hadoop
Hadoop Distributed File System: States that the files will be broken into blocks
and stored in nodes over the distributed architecture.
Yet another Resource Negotiator: Used for job scheduling and manage the
cluster.
Map Reduce: Framework which helps programs to do the parallel computation
on data using key value pair.
 Modules of Hadoop
Map Reduce: The Map task takes input data and converts it into a data set which
can be computed in Key value pair.The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.
Hadoop Common:These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
 Hadoop Architecture
NameNode
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds,etc. Does not use transaction for whole
blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode
failure.
 Hadoop Architecture
DataNode
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
 Hadoop Architecture
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. In response, NameNode provides metadata to Job
Tracker.
Task Tracker
It works as a slave node for Job Tracker. It receives task and code from Job Tracker
and applies that code on the file. This process can also be called as a Mapper.
 Hadoop vs RDBMS
Until recently many applications utilized RDBMS for batch processing – Oracle,
Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many architectures would
benefit from both Hadoop and a Relational product(s).
 Hadoop vs RDBMS
RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
 Comparison to RDBMS
Hadoop was not designed for real-time or low latency queries
Products that do provide low latency queries such as HBase have limited query
functionality
Hadoop performs best for offline batch processing on large amounts of data
RDBMS is best for online transactions and low-latency queries
Hadoop is designed to stream large files and large amounts of data
RDBMS works best with small records
 Hadoop Distributions
Let’s say you go download Hadoop’s HDFS and MapReduce.
At first it works great but then you decide to start using Hbase.
No problem, just download HBase and point it to your existing HDFS.
But you find that HBase can only work with a previous version of HDFS.
You go downgrade HDFS and everything still works great.
 Hadoop Distributions
Hadoop Distributions aim to resolve version incompatibilities.
DistributionVendor will do the following:
1. IntegrationTest a set of Hadoop products.
2. Distributions may provide additional scripts to execute Hadoop
3. Package Hadoop products in various installation formats
1. Linux Packages, tarballs
 Hadoop Distributions
 Distribution Vendors
Cloudera Hadoop Distribution
MapR Hadoop Distribution
AmazonWeb Services Elastic MapReduce Hadoop Distribution
Hortonworks Hadoop Distribution
IBM Infosphere BigInsights Hadoop Distribution
Microsoft Azure's HDInsight Cloud based Hadoop Distribution
 Cloudera Distribution for Hadoop
Most popular distribution
Cloudera has taken the lead on providing Hadoop Distribution
CDH is provided in various formats
Linux Packages,Virtual Machine Images, andTarballs
AmazonWeb Services Elastic MapReduce Hadoop Distribution.
Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig,
Sqoop,Whirr, Zookeeper, Flume
 References
• McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review.
• Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset
Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data
 Q & A
Big Data and Hadoop

Weitere ähnliche Inhalte

Was ist angesagt?

Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 

Was ist angesagt? (20)

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 

Ähnlich wie Big Data and Hadoop

Ähnlich wie Big Data and Hadoop (20)

Big Data
Big DataBig Data
Big Data
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Big Data and Hadoop

  • 1. Big Data and Hadoop Presented By: Maulik Lakhani Session 1 – Introduction to Big Data and Hadoop
  • 3.  Outline Introduction to Big Data Traditional Data Processing Introduction to Hadoop HadoopArchitecture Hadoop and RDBMS Hadoop Distributions
  • 4.  What is Big Data? Collection of large datasets that cannot be processed using traditional computing techniques. Not a single technique or a tool, rather a complete subject. Various tools, techniques and frameworks. It’s not the amount of data that’s important. It’s what organisations do with the data that matters. Big data can be analysed for insights that lead to better decisions and strategic business moves.
  • 5.  Traditional Data Processing For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM and others. An enterprise will have a computer to store and process big data. The user interacts with the application, which in turn handles the part of data storage and analysis.
  • 6.  Characteristics - Vs of Big Data • Quality of the data • Structured, Un- structured, Semi-structured • Periodic, Near- time, Real-time • Terabytes of data, Transactions, Files Volume Velocity VeracityVariety
  • 7.  Characteristics - Vs of Big Data Volume - Size of data • Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005. Velocity • There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is an increase of 22% year-over-year.
  • 8.  Characteristics - Vs of Big Data Variety • Data can be structured, semi-structured or unstructured in the form of images, audios, videos, sensor data. • Variety of data creates problems in capturing, storage, mining and analyzing the data. Veracity – data uncertainty, data inconsistency and incompleteness • Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to make decisions.
  • 9.  Characteristics - Vs of Big Data Veracity • Poor data quality costs the US economy around $3.1 trillion a year. Value • Adding to the benefits of the organizations. Is the organization working on Big Data achieving high ROI?
  • 10.  Ms of Big Data Make Me MoneyMore
  • 11.  What Comes Under Big Data? • Black Box Data: Voices of the flight crew, performance information of the aircraft. • Social Media Data: Posts and views by millions of people across the globe. • Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the stock exchange. • Power Grid Data: Electricity consumed by a particular node with respect to a base station. • Transport Data: Shipping / freight data.
  • 12.  Examples of Big Data • Walmart handles more than 1 million customer transactions every hour. • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • 230+ millions of tweets are created every day. • YouTube users upload 48 hours of new video every minute of the day. • Amazon handles 15 million customer click stream user data per day to recommend products. • 294 billion emails are sent every day. Services analyses this data to find the spams. • Modern cars have close to 100 sensors.
  • 13.  Applications of Big Data • Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance. • Telecom: Reduce data packet loss. Offer personalized plans. • Retail: Recommendation engines - Suggestion based on the browsing history of the consumer. • Traffic control: managing traffic better via effective use of data and sensors. • Manufacturing: reduce component defects, improve product quality, increase efficiency, and save time and money. • Search Quality: Personalised search results based on previous searches.
  • 14. AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go. AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously learned knowledge. It gains knowledge by machine learning, specifically by an artificial neural network from extensive training, both from human and computer play. In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for the first time.
  • 15. Google Photos Google implements different forms of machine learning into the Photos service, particularly in recognition of photo contents. People: Photos app collects all the photos containing faces. It doesn’t identify these people, but just collects them for quick access Places: The relies on landmarks. It can correctly identify well-known places like Taj Mahal. Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and cats.There are many more categories, including screenshots, posters and castles. Not everything in the garden is rosy!
  • 16.  Challenges with Big Data • Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to the companies each year in the United States. • Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and insights are very difficult. • Storage: The more data an organization has, the more complex the problem of managing it. • Lack of Talent: A developers, data scientists and analysts who also have sufficient amount of domain knowledge.
  • 17.  Summary of Big Data Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. Examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc. Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
  • 18.  Introduction to Hadoop Hadoop is an open source framework from Apache. Used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). Used for batch/offline processing.
  • 20.  Modules of Hadoop Hadoop Distributed File System: States that the files will be broken into blocks and stored in nodes over the distributed architecture. Yet another Resource Negotiator: Used for job scheduling and manage the cluster. Map Reduce: Framework which helps programs to do the parallel computation on data using key value pair.
  • 21.  Modules of Hadoop Map Reduce: The Map task takes input data and converts it into a data set which can be computed in Key value pair.The output of Map task is consumed by reduce task and then the out of reducer gives the desired result. Hadoop Common:These Java libraries are used to start Hadoop and are used by other Hadoop modules.
  • 26.  Hadoop Architecture NameNode • Stores metadata for the files, like the directory structure of a typical FS. • The server holding the NameNode instance is quite crucial, as there is only one. • Transaction log for file deletes/adds,etc. Does not use transaction for whole blocks or file-streams, only metadata. • Handles creation of more replica blocks when necessary after a DataNode failure.
  • 27.  Hadoop Architecture DataNode • Stores the actual data in HDFS • Can run on any underlying filesystem (ext3/4, NTFS, etc) • Notifies NameNode of what blocks it has • NameNode replicates blocks 2x in local rack, 1x elsewhere
  • 28.  Hadoop Architecture Job Tracker The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode. In response, NameNode provides metadata to Job Tracker. Task Tracker It works as a slave node for Job Tracker. It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.
  • 29.  Hadoop vs RDBMS Until recently many applications utilized RDBMS for batch processing – Oracle, Sybase, MySQL, Microsoft SQL Server, etc. Hadoop doesn’t fully replace relational products; many architectures would benefit from both Hadoop and a Relational product(s).
  • 30.  Hadoop vs RDBMS RDBMS products scale up • Expensive to scale for larger installations • Hits a ceiling when storage reaches 100s of terabytes Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
  • 31.  Comparison to RDBMS Hadoop was not designed for real-time or low latency queries Products that do provide low latency queries such as HBase have limited query functionality Hadoop performs best for offline batch processing on large amounts of data RDBMS is best for online transactions and low-latency queries Hadoop is designed to stream large files and large amounts of data RDBMS works best with small records
  • 32.  Hadoop Distributions Let’s say you go download Hadoop’s HDFS and MapReduce. At first it works great but then you decide to start using Hbase. No problem, just download HBase and point it to your existing HDFS. But you find that HBase can only work with a previous version of HDFS. You go downgrade HDFS and everything still works great.
  • 33.  Hadoop Distributions Hadoop Distributions aim to resolve version incompatibilities. DistributionVendor will do the following: 1. IntegrationTest a set of Hadoop products. 2. Distributions may provide additional scripts to execute Hadoop 3. Package Hadoop products in various installation formats 1. Linux Packages, tarballs
  • 35.  Distribution Vendors Cloudera Hadoop Distribution MapR Hadoop Distribution AmazonWeb Services Elastic MapReduce Hadoop Distribution Hortonworks Hadoop Distribution IBM Infosphere BigInsights Hadoop Distribution Microsoft Azure's HDInsight Cloud based Hadoop Distribution
  • 36.  Cloudera Distribution for Hadoop Most popular distribution Cloudera has taken the lead on providing Hadoop Distribution CDH is provided in various formats Linux Packages,Virtual Machine Images, andTarballs AmazonWeb Services Elastic MapReduce Hadoop Distribution. Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig, Sqoop,Whirr, Zookeeper, Flume
  • 37.  References • McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review. • Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data
  • 38.  Q & A