SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Big Data Management
on
Apache Hadoop
- Naresh Chintalcheru
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Frameworks: MapReduce and YARN
■ Big Integration: Hadoop & BI Tools (SAP Business Objects, IBM Cognos)
■ Future of Hadoop: Batch to Real-time
What is Big Data ?
Big Data is a collection of data sets so large and complex that it
becomes difficult to process using traditional database
management tools.
-Wikipedia
What is Big Data ?
Big Data is a collection of data sets so large and complex that it becomes difficult to
process using traditional database management tools.
● Large data sets in terms of terabytes and petabytes
● Complex with different data types and formats
● Difficult to process with traditional database tools and involve expensive &
proprietary solutions
What is Big Data ?
Big Data is all about the size ?
Big Data V-V-V-V
Big data is explained using 4 V's
● Volume
● Velocity
● Variety
● Variability
Big Volume
Data usage over the years ....
● 3 1/2 inch Floppy Disk max capacity 1.44MB
● CD max capacity 700MB (Music)
● DVD capacity range 10GB (Movies)
● Blu-Ray Disc 25GB (HD, 3D Movies)
● iPod Classic 160GB
● 3TB hard drive for $130 amazon.com
Big Volume
Imagine your own personal life ...
● Couple decades ago postal mails from friends, household bills and printed
family pictures
● Majority of communications are replaced by Facebook messages, Tweets, SMS
Texts and Emails (fading away)
● Upload pictures to Facebook, Flickr or Picasa
● How many bills you pay online ?. You can look up online how much you paid for
the same service last year
Big Velocity
Exponential growth of Corporate & Personal Data
● Personal data
○ More music, more movies and more online transactions
● Facebook processed (infoq.com)
○ 2 PB of data in 2009
○ 20PB of data in 2010
○ 60PB of data in 2011
○ 100 PB of data in 2012
● Every Sixty Seconds ... (dzone.com)
○ 694,445 Google Searches
○ 6,600+ pictures uploaded to flickr
○ 98,000 tweets
○ 600 videos uploaded to youtube
○ 13,000 iPhone Apps downloaded
Big Variety
Flavors of data can be just as shocking because combinations of relational data,
unstructured data such as text, images, video, and every other variation can cause
complexity in storing, processing, and querying the data.
Traditional Data Big Data
Text Data Emails, Documents Pictures, images
Stock records Audio, Video
Finances 3D Models
Personal files Location Sensor data
Big Variability
Data continuously changing ...
● It took years for traditional RDBMS to add an XML column
● Still no JSON Column type in RDMS
● Many more new formats to come
Dealing with variability in traditional databases is a very very
slow process
Problem with RDBMS
● RDBMS or traditional database deals with Structured Data
● 20% of corporate data is Structured and 80% is
Unstructured
● Predefined database Schema and Data type makes it
harder to adapt to new data formats
● RDBMS horizontal scaling is complex and expensive
Power of Big Data
Big Data
● Deals with unstructured data
● Built on horizontal scaling architecture
Big Data Sources
Data collected from ...
Weblogs, Social Network
Video archives, Photography archives
Mobile Phone data, Sensors
RFID barcodes
Medical records
Atmospheric Science
Personal Finance
Camera surveillance
e-commerce and m-commerce transactions
Big Data Benefits
Create new revenue streams for the companies
The insights that you gain from analyzing your market and its consumers with Big
Data.
Perform effective risk analysis
Predictive analytics, fueled by Big Data allows you to scan and analyze newspaper
reports or social media feeds so that you permanently keep up to speed on the latest
developments in your industry
Re-design Products
Big Data can also help you understand how others perceive your products so that you
can adapt them, or your marketing
Social Intelligence
Emergence of Social Intelligence similar to Business Intelligence from social network
websites
Security Benefits
Web logs are saved and analysed for unusual access behaviours
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Frameworks: MapReduce and YARN
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Big Inspiration
Google released series of paper on the technology behind the
Search Product.
● Google released first paper on Distributed File System GFS
in 2003.
● Released second paper about MapReduce framework in
2004.
● Released next paper on BigTable in 2006.
Big Inspiration
Inspired by the Google papers ....
Doug Cutting, Yahoo employee at the time saw the opportunity
and led the charge of developing open source version of GFS
& Google MapReduce. Named it after the kids toy Hadoop.
Big Inspiration
Google Products Apache Hadoop Products
GFS: Google File System HDFS: Hadoop Distributed File System
GMR: Google MapReduce MapReduce
BigTable HBase
Google Dremel Apache Drill
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Frameworks: MapReduce and YARN
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Hadoop Architecture
Hadoop Architecture
Unlike traditional databases Hadoop divides Data
Processing and Data Storage into different nodes.
Hadoop Architecture
Hadoop Architecture
What is Hadoop ?
A scalable fault-tolerant grid operating system for
data storage and processing.
-Cloudera
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Frameworks: MapReduce and YARN
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
HDFS
HDFS: Hadoop Distributed File System
● Self-healing high-bandwidth clustered storage.
● Streaming very large files on the commodity servers.
● Store data in the File format.
● Divides single file into Multiple Blocks
● Fault-tolerant to hardware failures
HDFS
HDFS
HBase
HBase Database
● Key/Value data store
● Distributed, multi-dimensional sorted map.
● Modeled after Google BigTable
● Not a RDBMS and light schema
● Random updates to the data possible unlike HDFS.
HBase Architecture
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Frameworks: MapReduce and YARN
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
MapReduce
What is MapReduce ?
● Programming model to process large scale data in parallel
● Automatic parallelization and distribution
● Two phase processing Map phase & Reduce phase
● Job Tracker and Task Tracker
● Handle machine failures just like HDFS
MapReduce
MapReduce Framework
Map Phase:
Extracts something you care about each record then Shuffle
and Sort the records
Reduce Phase:
Gets input from the Map Phase then aggregate, filter, transform
and summarize the results.
MapReduce Architecture
MapReduce Architecture
YARN Framework
What is YARN ?
● Yet Another Resource Negotiator
● Next generation MapReduce framework
● No Job Tracker to control the Task Trackers
● Each job controls its own destiny using Application Master
taking care of execution flow such as scheduling tasks,
handling speculative execution and failures, etc.
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Frameworks: MapReduce and YARN
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Hive
What is Hive ?
Developed by Facebook engineers and donated to Apache.
Apache Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and analysis.
Operates on compressed data stored into Hadoop ecosystem.
Hive
● Query language for HDFS and HBase
● Provides SQL like language called HiveQL
● Automatic conversion of Hive Queries to MapReduce Jobs
● Accelerate queries by providing Indexes
● Metadata storage in an RDBMS, significantly reducing the
time to perform semantic checks during query execution
● Facebook has biggest Hive implementation
Apache Pig
● Developed by Yahoo Pig is a Scripting based query
language for HDFS and HBase
● Language for this platform is called Pig Latin
● Automatic conversion of Pig Latin Scripts to MapReduce
Jobs. Ad-hoc way of creating and executing MapReduce
jobs
● Differences between Pig and SQL include Pig's usage of
lazy evaluation and ability to store data at any point during a
pipeline, explicit declaration of execution plans
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Frameworks: MapReduce and YARN
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Apache Flume
● Hadoop can store and process all the weblogs, network
logs and sensor log data.
● But how the data which is stored on the different servers
supplied to the Hadoop Cluster ?
Apache Flume comes to rescue
Apache Flume
● Flume is the distributed data collection service that gets
flows of data from the source and aggregates them to
where they have to be processed.
● Goals include reliability, scalability and extensability.
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Frameworks: MapReduce and YARN
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Integration to SAP Business Objects
● Business Objects v4.0 supports Apache Hadoop and Hive
● Business Objects access Hadoop using Hive as a Data
Source.
● Uses JDBC Driver to connect to the Hadoop Hive.
http://events.asug.
com/2012BOUC/1210_SAP_BusinessObjects_BI_4_0_FP3_o
n_Apache_Hadoop_Hive.pdf
Integration to IBM Cognos
● IBM offers support to Hadoop and named the product IBM
InfoSphere BigInsights
● Added a Web based analytical tool called BigSheets
● InfoSphere Biginsights has full integration with Cognos
reporting tool
http://www-304.ibm.com/easyaccess/fileserve?
contentid=217007
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Frameworks: MapReduce and YARN
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos)
■ Future of Hadoop: Batch to Real-time
Future of Hadoop
● The Big Data is here to stay and companies going to lose in
a big way if they don't utilize the data science opportunity.
● Might see a new enterprise role called Data Scientist
● Apache Hadoop is a cutting data technology and all the
current frameworks & tools will change drastically.
Batch to Real-time
● Problem with Hadoop
○ The nature of Hadoop jobs are Batch process and high
latency.
● Google Dremel
○ Google released another paper called Dremel project
which is the real-time processing of the Big Data.
○ The open source community started Apache Drill which
will implement Dremel like real-time processing to
Hadoop ecosystem.
References
Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/
Apache Hadoop - tutorial
Thank you
Thanks!

Weitere ähnliche Inhalte

Andere mochten auch

Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Monkey runner & Monkey testing
Monkey runner & Monkey testingMonkey runner & Monkey testing
Monkey runner & Monkey testingSWAAM Tech
 
Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Gerben Zaagsma
 
Touch Screen Based Home Automation System
Touch Screen Based Home Automation SystemTouch Screen Based Home Automation System
Touch Screen Based Home Automation SystemEdgefxkits & Solutions
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
MonkeyTalk Automation Testing For Android Application
MonkeyTalk Automation Testing For Android ApplicationMonkeyTalk Automation Testing For Android Application
MonkeyTalk Automation Testing For Android ApplicationContusQA
 
automation in construction
automation in constructionautomation in construction
automation in constructionAnand Khare
 
Robots & Automation
Robots & AutomationRobots & Automation
Robots & Automationcemal
 

Andere mochten auch (17)

Monkey talk
Monkey talkMonkey talk
Monkey talk
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Monkey runner & Monkey testing
Monkey runner & Monkey testingMonkey runner & Monkey testing
Monkey runner & Monkey testing
 
HMI
HMIHMI
HMI
 
Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...
 
Human machine interface
Human machine interfaceHuman machine interface
Human machine interface
 
Touch Screen Based Home Automation System
Touch Screen Based Home Automation SystemTouch Screen Based Home Automation System
Touch Screen Based Home Automation System
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
MonkeyTalk Automation Testing For Android Application
MonkeyTalk Automation Testing For Android ApplicationMonkeyTalk Automation Testing For Android Application
MonkeyTalk Automation Testing For Android Application
 
automation in construction
automation in constructionautomation in construction
automation in construction
 
Scada & hmi
Scada & hmiScada & hmi
Scada & hmi
 
Big data mining
Big data miningBig data mining
Big data mining
 
Robots & Automation
Robots & AutomationRobots & Automation
Robots & Automation
 

Mehr von Naresh Chintalcheru

Bimodal IT for Speed and Innovation
Bimodal IT for Speed and InnovationBimodal IT for Speed and Innovation
Bimodal IT for Speed and InnovationNaresh Chintalcheru
 
Introduction to Node.js Platform
Introduction to Node.js PlatformIntroduction to Node.js Platform
Introduction to Node.js PlatformNaresh Chintalcheru
 
3rd Generation Web Application Platforms
3rd Generation Web Application Platforms3rd Generation Web Application Platforms
3rd Generation Web Application PlatformsNaresh Chintalcheru
 
Asynchronous Processing in Java/JEE/Spring
Asynchronous Processing in Java/JEE/SpringAsynchronous Processing in Java/JEE/Spring
Asynchronous Processing in Java/JEE/SpringNaresh Chintalcheru
 
Problems opening SOA to the Online Web Applications
Problems opening SOA to the Online Web ApplicationsProblems opening SOA to the Online Web Applications
Problems opening SOA to the Online Web ApplicationsNaresh Chintalcheru
 
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Naresh Chintalcheru
 
Java7 New Features and Code Examples
Java7 New Features and Code ExamplesJava7 New Features and Code Examples
Java7 New Features and Code ExamplesNaresh Chintalcheru
 
Design & Develop Batch Applications in Java/JEE
Design & Develop Batch Applications in Java/JEEDesign & Develop Batch Applications in Java/JEE
Design & Develop Batch Applications in Java/JEENaresh Chintalcheru
 
Building Next Generation Real-Time Web Applications using Websockets
Building Next Generation Real-Time Web Applications using WebsocketsBuilding Next Generation Real-Time Web Applications using Websockets
Building Next Generation Real-Time Web Applications using WebsocketsNaresh Chintalcheru
 
Automation Testing using Selenium
Automation Testing using SeleniumAutomation Testing using Selenium
Automation Testing using SeleniumNaresh Chintalcheru
 
Design & Development of Web Applications using SpringMVC
Design & Development of Web Applications using SpringMVC Design & Development of Web Applications using SpringMVC
Design & Development of Web Applications using SpringMVC Naresh Chintalcheru
 
Object-Oriented Polymorphism Unleashed
Object-Oriented Polymorphism UnleashedObject-Oriented Polymorphism Unleashed
Object-Oriented Polymorphism UnleashedNaresh Chintalcheru
 

Mehr von Naresh Chintalcheru (17)

Cars.com Journey to AWS Cloud
Cars.com Journey to AWS CloudCars.com Journey to AWS Cloud
Cars.com Journey to AWS Cloud
 
Bimodal IT for Speed and Innovation
Bimodal IT for Speed and InnovationBimodal IT for Speed and Innovation
Bimodal IT for Speed and Innovation
 
Reactive systems
Reactive systemsReactive systems
Reactive systems
 
Introduction to Node.js Platform
Introduction to Node.js PlatformIntroduction to Node.js Platform
Introduction to Node.js Platform
 
3rd Generation Web Application Platforms
3rd Generation Web Application Platforms3rd Generation Web Application Platforms
3rd Generation Web Application Platforms
 
Asynchronous Processing in Java/JEE/Spring
Asynchronous Processing in Java/JEE/SpringAsynchronous Processing in Java/JEE/Spring
Asynchronous Processing in Java/JEE/Spring
 
Problems opening SOA to the Online Web Applications
Problems opening SOA to the Online Web ApplicationsProblems opening SOA to the Online Web Applications
Problems opening SOA to the Online Web Applications
 
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
 
Java7 New Features and Code Examples
Java7 New Features and Code ExamplesJava7 New Features and Code Examples
Java7 New Features and Code Examples
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 
Design & Develop Batch Applications in Java/JEE
Design & Develop Batch Applications in Java/JEEDesign & Develop Batch Applications in Java/JEE
Design & Develop Batch Applications in Java/JEE
 
Building Next Generation Real-Time Web Applications using Websockets
Building Next Generation Real-Time Web Applications using WebsocketsBuilding Next Generation Real-Time Web Applications using Websockets
Building Next Generation Real-Time Web Applications using Websockets
 
Mule ESB Fundamentals
Mule ESB FundamentalsMule ESB Fundamentals
Mule ESB Fundamentals
 
Automation Testing using Selenium
Automation Testing using SeleniumAutomation Testing using Selenium
Automation Testing using Selenium
 
Design & Development of Web Applications using SpringMVC
Design & Development of Web Applications using SpringMVC Design & Development of Web Applications using SpringMVC
Design & Development of Web Applications using SpringMVC
 
Android Platform Architecture
Android Platform ArchitectureAndroid Platform Architecture
Android Platform Architecture
 
Object-Oriented Polymorphism Unleashed
Object-Oriented Polymorphism UnleashedObject-Oriented Polymorphism Unleashed
Object-Oriented Polymorphism Unleashed
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Apache Hadoop - BigData Management

  • 1. Big Data Management on Apache Hadoop - Naresh Chintalcheru
  • 2. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (SAP Business Objects, IBM Cognos) ■ Future of Hadoop: Batch to Real-time
  • 3. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. -Wikipedia
  • 4. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. ● Large data sets in terms of terabytes and petabytes ● Complex with different data types and formats ● Difficult to process with traditional database tools and involve expensive & proprietary solutions
  • 5. What is Big Data ? Big Data is all about the size ?
  • 6. Big Data V-V-V-V Big data is explained using 4 V's ● Volume ● Velocity ● Variety ● Variability
  • 7. Big Volume Data usage over the years .... ● 3 1/2 inch Floppy Disk max capacity 1.44MB ● CD max capacity 700MB (Music) ● DVD capacity range 10GB (Movies) ● Blu-Ray Disc 25GB (HD, 3D Movies) ● iPod Classic 160GB ● 3TB hard drive for $130 amazon.com
  • 8. Big Volume Imagine your own personal life ... ● Couple decades ago postal mails from friends, household bills and printed family pictures ● Majority of communications are replaced by Facebook messages, Tweets, SMS Texts and Emails (fading away) ● Upload pictures to Facebook, Flickr or Picasa ● How many bills you pay online ?. You can look up online how much you paid for the same service last year
  • 9. Big Velocity Exponential growth of Corporate & Personal Data ● Personal data ○ More music, more movies and more online transactions ● Facebook processed (infoq.com) ○ 2 PB of data in 2009 ○ 20PB of data in 2010 ○ 60PB of data in 2011 ○ 100 PB of data in 2012 ● Every Sixty Seconds ... (dzone.com) ○ 694,445 Google Searches ○ 6,600+ pictures uploaded to flickr ○ 98,000 tweets ○ 600 videos uploaded to youtube ○ 13,000 iPhone Apps downloaded
  • 10. Big Variety Flavors of data can be just as shocking because combinations of relational data, unstructured data such as text, images, video, and every other variation can cause complexity in storing, processing, and querying the data. Traditional Data Big Data Text Data Emails, Documents Pictures, images Stock records Audio, Video Finances 3D Models Personal files Location Sensor data
  • 11. Big Variability Data continuously changing ... ● It took years for traditional RDBMS to add an XML column ● Still no JSON Column type in RDMS ● Many more new formats to come Dealing with variability in traditional databases is a very very slow process
  • 12. Problem with RDBMS ● RDBMS or traditional database deals with Structured Data ● 20% of corporate data is Structured and 80% is Unstructured ● Predefined database Schema and Data type makes it harder to adapt to new data formats ● RDBMS horizontal scaling is complex and expensive
  • 13. Power of Big Data Big Data ● Deals with unstructured data ● Built on horizontal scaling architecture
  • 14. Big Data Sources Data collected from ... Weblogs, Social Network Video archives, Photography archives Mobile Phone data, Sensors RFID barcodes Medical records Atmospheric Science Personal Finance Camera surveillance e-commerce and m-commerce transactions
  • 15. Big Data Benefits Create new revenue streams for the companies The insights that you gain from analyzing your market and its consumers with Big Data. Perform effective risk analysis Predictive analytics, fueled by Big Data allows you to scan and analyze newspaper reports or social media feeds so that you permanently keep up to speed on the latest developments in your industry Re-design Products Big Data can also help you understand how others perceive your products so that you can adapt them, or your marketing Social Intelligence Emergence of Social Intelligence similar to Business Intelligence from social network websites Security Benefits Web logs are saved and analysed for unusual access behaviours
  • 16. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 17. Big Inspiration Google released series of paper on the technology behind the Search Product. ● Google released first paper on Distributed File System GFS in 2003. ● Released second paper about MapReduce framework in 2004. ● Released next paper on BigTable in 2006.
  • 18. Big Inspiration Inspired by the Google papers .... Doug Cutting, Yahoo employee at the time saw the opportunity and led the charge of developing open source version of GFS & Google MapReduce. Named it after the kids toy Hadoop.
  • 19. Big Inspiration Google Products Apache Hadoop Products GFS: Google File System HDFS: Hadoop Distributed File System GMR: Google MapReduce MapReduce BigTable HBase Google Dremel Apache Drill
  • 20. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 22. Hadoop Architecture Unlike traditional databases Hadoop divides Data Processing and Data Storage into different nodes.
  • 24. Hadoop Architecture What is Hadoop ? A scalable fault-tolerant grid operating system for data storage and processing. -Cloudera
  • 25. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 26. HDFS HDFS: Hadoop Distributed File System ● Self-healing high-bandwidth clustered storage. ● Streaming very large files on the commodity servers. ● Store data in the File format. ● Divides single file into Multiple Blocks ● Fault-tolerant to hardware failures
  • 27. HDFS
  • 28. HDFS
  • 29. HBase HBase Database ● Key/Value data store ● Distributed, multi-dimensional sorted map. ● Modeled after Google BigTable ● Not a RDBMS and light schema ● Random updates to the data possible unlike HDFS.
  • 31. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 32. MapReduce What is MapReduce ? ● Programming model to process large scale data in parallel ● Automatic parallelization and distribution ● Two phase processing Map phase & Reduce phase ● Job Tracker and Task Tracker ● Handle machine failures just like HDFS
  • 33. MapReduce MapReduce Framework Map Phase: Extracts something you care about each record then Shuffle and Sort the records Reduce Phase: Gets input from the Map Phase then aggregate, filter, transform and summarize the results.
  • 36. YARN Framework What is YARN ? ● Yet Another Resource Negotiator ● Next generation MapReduce framework ● No Job Tracker to control the Task Trackers ● Each job controls its own destiny using Application Master taking care of execution flow such as scheduling tasks, handling speculative execution and failures, etc.
  • 37. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 38. Hive What is Hive ? Developed by Facebook engineers and donated to Apache. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Operates on compressed data stored into Hadoop ecosystem.
  • 39. Hive ● Query language for HDFS and HBase ● Provides SQL like language called HiveQL ● Automatic conversion of Hive Queries to MapReduce Jobs ● Accelerate queries by providing Indexes ● Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution ● Facebook has biggest Hive implementation
  • 40. Apache Pig ● Developed by Yahoo Pig is a Scripting based query language for HDFS and HBase ● Language for this platform is called Pig Latin ● Automatic conversion of Pig Latin Scripts to MapReduce Jobs. Ad-hoc way of creating and executing MapReduce jobs ● Differences between Pig and SQL include Pig's usage of lazy evaluation and ability to store data at any point during a pipeline, explicit declaration of execution plans
  • 41. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 42. Apache Flume ● Hadoop can store and process all the weblogs, network logs and sensor log data. ● But how the data which is stored on the different servers supplied to the Hadoop Cluster ? Apache Flume comes to rescue
  • 43. Apache Flume ● Flume is the distributed data collection service that gets flows of data from the source and aggregates them to where they have to be processed. ● Goals include reliability, scalability and extensability.
  • 44. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 45. Integration to SAP Business Objects ● Business Objects v4.0 supports Apache Hadoop and Hive ● Business Objects access Hadoop using Hive as a Data Source. ● Uses JDBC Driver to connect to the Hadoop Hive. http://events.asug. com/2012BOUC/1210_SAP_BusinessObjects_BI_4_0_FP3_o n_Apache_Hadoop_Hive.pdf
  • 46. Integration to IBM Cognos ● IBM offers support to Hadoop and named the product IBM InfoSphere BigInsights ● Added a Web based analytical tool called BigSheets ● InfoSphere Biginsights has full integration with Cognos reporting tool http://www-304.ibm.com/easyaccess/fileserve? contentid=217007
  • 47. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 48. Future of Hadoop ● The Big Data is here to stay and companies going to lose in a big way if they don't utilize the data science opportunity. ● Might see a new enterprise role called Data Scientist ● Apache Hadoop is a cutting data technology and all the current frameworks & tools will change drastically.
  • 49. Batch to Real-time ● Problem with Hadoop ○ The nature of Hadoop jobs are Batch process and high latency. ● Google Dremel ○ Google released another paper called Dremel project which is the real-time processing of the Big Data. ○ The open source community started Apache Drill which will implement Dremel like real-time processing to Hadoop ecosystem.
  • 50. References Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop - tutorial