First slide of Hadoop:
* Introduction to Big Data and Hadoop:
- Presenting and defining big data
- Introducing Hadoop and History
- Hadoop - how it works?
- HDFS
2. Hadoop is ...
A scalable fault tolerant distributed for data storage and
processing (open source under the Apache license)
- Core Hadoop has two main systems:
● Hadoop Distributed FileSystem (HDFS):
self-healing, high-bandwidth clustered storage
● MapReduce: distributed fault-tolerant
resource management and scheduling
coupled with a scalable data programming
abstraction
5. Etymology
● Hadoop was created in 2004
by "Douglass (Doug) Cutting"
● Implemented Google
Filesystem and Big Tables
papers
● He aimed it, to index the
internet in google style for
startup search engine 'Nutch'
● Named it after his son's
elephant shaped favourite
toy named hadoop
6. What is Big Data?
"In Information Technology, big data is loosely
defined term used to describe set so large and
complex that they became awkward to work with
using on-hand database management tools."
Wikipedia
7.
8.
9.
10.
11.
12. How big is big?
● 2008: Google processes 20PB a day
● 2012: Facebook ingests 500TB of data a day
● 2009: eBay has 6.5 PB user data + 50 TB a day
● 2011: Yahoo! has 180-200 PB of data
13. Limitations of Existing Analytics Architecture
Can't explore original raw data
BI Reports + Online Apps
RDBMS (aggregated data)
ETL (Extract, Transfer & Load)
Moving Data from storage to
compute doesn't scale!
Storage Grid
Archiving = Premature death
Mostly Append
Data Collection
Instrumentation (Raw Data Sources)
14. Why Hadoop?
Challenge: Read 1 TB of data
1 Machine
- 4 IO channels
- Each channel: 100 MB/s
?
45 minutes
10 Machines
- 4 IO channels
- Each channel: 100 MB/s
4.5 minutes
?
16. The Key Benefit: Agility/Flexibility
Schema-On-Write (RDBMS)
Schema-On-Read (Hadoop)
- Schema must be created before any
data can be loaded
- Data is simply copied to the file store, no
transformations are needed
- An explicit load operation has to take
place which transforms data to DB internal
structure
- A SerDe (Serializer/Deserializer) is
applied during read tume to extract the
required column (late binding)
- New columns must be be added
explicitly before new data for such
columns can be loaded into the database
- New data can strat flowing anytime and
will appear retroactively once the SerDe is
updated to parse it
- Reads are fast
- Standards / Governance
- Load is fast
- Flexibility / Agility
19. Underlying FS options
ext3
- released in 2001
- Used by Yahoo!
- bootstrap + format slow
- set:
- noatime
- tune2fs (to turn
off reserved blocks)
ext4
- released in 2008
- Used by Google
- Fast as XFS
- set:
- delayed
allocation off
-noatime
- tune2fs (to turn off
reserved blocks)
XFS
- released in 1993
- Fast
- Drawbacks:
- deleting large # of files