Apache Hadoop - BigData Management

Big Data Management
on
Apache Hadoop
- Naresh Chintalcheru

Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS and HBase
■ Big Queries: Hive and Pig Latin
■ Big Pipes: Flume and Scoop
■ Big Frameworks: MapReduce and YARN
■ Big Integration: Hadoop & BI Tools (SAP Business Objects, IBM Cognos)
■ Future of Hadoop: Batch to Real-time

What is Big Data ?
Big Data is a collection of data sets so large and complex that it
becomes difficult to process using traditional database
management tools.
-Wikipedia

What is Big Data ?
Big Data is a collection of data sets so large and complex that it becomes difficult to
process using traditional database management tools.
● Large data sets in terms of terabytes and petabytes
● Complex with different data types and formats
● Difficult to process with traditional database tools and involve expensive &
proprietary solutions

What is Big Data ?
Big Data is all about the size ?

Big Data V-V-V-V
Big data is explained using 4 V's
● Volume
● Velocity
● Variety
● Variability

Big Volume
Data usage over the years ....
● 3 1/2 inch Floppy Disk max capacity 1.44MB
● CD max capacity 700MB (Music)
● DVD capacity range 10GB (Movies)
● Blu-Ray Disc 25GB (HD, 3D Movies)
● iPod Classic 160GB
● 3TB hard drive for $130 amazon.com

Big Volume
Imagine your own personal life ...
● Couple decades ago postal mails from friends, household bills and printed
family pictures
● Majority of communications are replaced by Facebook messages, Tweets, SMS
Texts and Emails (fading away)
● Upload pictures to Facebook, Flickr or Picasa
● How many bills you pay online ?. You can look up online how much you paid for
the same service last year

Big Velocity
Exponential growth of Corporate & Personal Data
● Personal data
○ More music, more movies and more online transactions
● Facebook processed (infoq.com)
○ 2 PB of data in 2009
○ 20PB of data in 2010
○ 60PB of data in 2011
○ 100 PB of data in 2012
● Every Sixty Seconds ... (dzone.com)
○ 694,445 Google Searches
○ 6,600+ pictures uploaded to flickr
○ 98,000 tweets
○ 600 videos uploaded to youtube
○ 13,000 iPhone Apps downloaded

Big Variety
Flavors of data can be just as shocking because combinations of relational data,
unstructured data such as text, images, video, and every other variation can cause
complexity in storing, processing, and querying the data.
Traditional Data Big Data
Text Data Emails, Documents Pictures, images
Stock records Audio, Video
Finances 3D Models
Personal files Location Sensor data

Big Variability
Data continuously changing ...
● It took years for traditional RDBMS to add an XML column
● Still no JSON Column type in RDMS
● Many more new formats to come
Dealing with variability in traditional databases is a very very
slow process

Problem with RDBMS
● RDBMS or traditional database deals with Structured Data
● 20% of corporate data is Structured and 80% is
Unstructured
● Predefined database Schema and Data type makes it
harder to adapt to new data formats
● RDBMS horizontal scaling is complex and expensive

Power of Big Data
Big Data
● Deals with unstructured data
● Built on horizontal scaling architecture

Big Data Sources
Data collected from ...
Weblogs, Social Network
Video archives, Photography archives
Mobile Phone data, Sensors
RFID barcodes
Medical records
Atmospheric Science
Personal Finance
Camera surveillance
e-commerce and m-commerce transactions

Big Data Benefits
Create new revenue streams for the companies
The insights that you gain from analyzing your market and its consumers with Big
Data.
Perform effective risk analysis
Predictive analytics, fueled by Big Data allows you to scan and analyze newspaper
reports or social media feeds so that you permanently keep up to speed on the latest
developments in your industry
Re-design Products
Big Data can also help you understand how others perceive your products so that you
can adapt them, or your marketing
Social Intelligence
Emergence of Social Intelligence similar to Business Intelligence from social network
websites
Security Benefits
Web logs are saved and analysed for unusual access behaviours

Agenda
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)

Big Inspiration
Google released series of paper on the technology behind the
Search Product.
● Google released first paper on Distributed File System GFS
in 2003.
● Released second paper about MapReduce framework in
2004.
● Released next paper on BigTable in 2006.

Big Inspiration
Inspired by the Google papers ....
Doug Cutting, Yahoo employee at the time saw the opportunity
and led the charge of developing open source version of GFS
& Google MapReduce. Named it after the kids toy Hadoop.

Big Inspiration
Google Products Apache Hadoop Products
GFS: Google File System HDFS: Hadoop Distributed File System
GMR: Google MapReduce MapReduce
BigTable HBase
Google Dremel Apache Drill

Hadoop Architecture
Unlike traditional databases Hadoop divides Data
Processing and Data Storage into different nodes.

Hadoop Architecture
What is Hadoop ?
A scalable fault-tolerant grid operating system for
data storage and processing.
-Cloudera

HDFS
HDFS: Hadoop Distributed File System
● Self-healing high-bandwidth clustered storage.
● Streaming very large files on the commodity servers.
● Store data in the File format.
● Divides single file into Multiple Blocks
● Fault-tolerant to hardware failures

HBase
HBase Database
● Key/Value data store
● Distributed, multi-dimensional sorted map.
● Modeled after Google BigTable
● Not a RDBMS and light schema
● Random updates to the data possible unlike HDFS.

Agenda
■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos)

MapReduce
What is MapReduce ?
● Programming model to process large scale data in parallel
● Automatic parallelization and distribution
● Two phase processing Map phase & Reduce phase
● Job Tracker and Task Tracker
● Handle machine failures just like HDFS

MapReduce
MapReduce Framework
Map Phase:
Extracts something you care about each record then Shuffle
and Sort the records
Reduce Phase:
Gets input from the Map Phase then aggregate, filter, transform
and summarize the results.

YARN Framework
What is YARN ?
● Yet Another Resource Negotiator
● Next generation MapReduce framework
● No Job Tracker to control the Task Trackers
● Each job controls its own destiny using Application Master
taking care of execution flow such as scheduling tasks,
handling speculative execution and failures, etc.

Hive
What is Hive ?
Developed by Facebook engineers and donated to Apache.
Apache Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and analysis.
Operates on compressed data stored into Hadoop ecosystem.

Hive
● Query language for HDFS and HBase
● Provides SQL like language called HiveQL
● Automatic conversion of Hive Queries to MapReduce Jobs
● Accelerate queries by providing Indexes
● Metadata storage in an RDBMS, significantly reducing the
time to perform semantic checks during query execution
● Facebook has biggest Hive implementation

Apache Pig
● Developed by Yahoo Pig is a Scripting based query
language for HDFS and HBase
● Language for this platform is called Pig Latin
● Automatic conversion of Pig Latin Scripts to MapReduce
Jobs. Ad-hoc way of creating and executing MapReduce
jobs
● Differences between Pig and SQL include Pig's usage of
lazy evaluation and ability to store data at any point during a
pipeline, explicit declaration of execution plans

Apache Flume
● Hadoop can store and process all the weblogs, network
logs and sensor log data.
● But how the data which is stored on the different servers
supplied to the Hadoop Cluster ?
Apache Flume comes to rescue

Apache Flume
● Flume is the distributed data collection service that gets
flows of data from the source and aggregates them to
where they have to be processed.
● Goals include reliability, scalability and extensability.

Agenda
■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos)

Integration to SAP Business Objects
● Business Objects v4.0 supports Apache Hadoop and Hive
● Business Objects access Hadoop using Hive as a Data
Source.
● Uses JDBC Driver to connect to the Hadoop Hive.
http://events.asug.
com/2012BOUC/1210_SAP_BusinessObjects_BI_4_0_FP3_o
n_Apache_Hadoop_Hive.pdf

Integration to IBM Cognos
● IBM offers support to Hadoop and named the product IBM
InfoSphere BigInsights
● Added a Web based analytical tool called BigSheets
● InfoSphere Biginsights has full integration with Cognos
reporting tool
http://www-304.ibm.com/easyaccess/fileserve?
contentid=217007

Future of Hadoop
● The Big Data is here to stay and companies going to lose in
a big way if they don't utilize the data science opportunity.
● Might see a new enterprise role called Data Scientist
● Apache Hadoop is a cutting data technology and all the
current frameworks & tools will change drastically.

Batch to Real-time
● Problem with Hadoop
○ The nature of Hadoop jobs are Batch process and high
latency.
● Google Dremel
○ Google released another paper called Dremel project
which is the real-time processing of the Big Data.
○ The open source community started Apache Drill which
will implement Dremel like real-time processing to
Hadoop ecosystem.

References
Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/
Apache Hadoop - tutorial

Apache Hadoop - BigData Management

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (17)

Mehr von Naresh Chintalcheru

Mehr von Naresh Chintalcheru (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache Hadoop - BigData Management