Hadoop Ecosystem Overview

BigData Eco System overview
●Log Formats
●Compression
●Collecting Data
●Distributed Storage
●Distributed Processing
●Workflow Management
●Realtime Read Write Storage
●Others

Log Formats
● Google Protocol Buffers
● Thrift
Typed Data (and int is an int etc)
Named Flexible Schema
Backwards compatibility

Compression
Default (slowish)
● GZIP (splittable)
● Bzip2 (slowest)
● 7zip (only for long archival)
Fastest
● Snappy (not splittable)
● LZO (not splittable)
●

Compression LZO/Snappy
● Saves data
● CPU/RAM impact is negligible
● Less bytes in disk makes for faster reads and
writes

Collecting Data
File collection
● BigStreams
Pub Subscribe collection
● Kafka + Kafka-Collector
● Scribe
● Flume

Data Pipeline
● Reliable
● Auto healing
● Cache data for N days
● Automatic
● Fast
● Distributed

Distributed Storage
● HDFS (master slaves)
● S3 (bucket, key → blob, no master)
● GridFS ?
● NFS (not for high writes or big data)

Hadoop Distributed File System
Advantages
● Simple (more or less)
● Works with every day hardware (cheap to scale)
● Proven scalability to petabytes
● Lends itself to efficient distributed batch processing
Disadvantages
● Single Point of Failure (HA is a work in progress)
● All meta-data must fit in master's RAM
● No Random Read/Writes

S3
Advantages
● Cheap
● distributed
● Good for data archival
Disadvantages
● Data is stored externally
● Does not lend itself to batch processing of large
volumes of data

Distributed Processing
● Hadoop MapReduce
● GridGain
● Storm
● Akka Actors

Hadoop MapReduce
● Used for distributed serial batch processing
● Works with HDFS
● Simple concept but complex APIs
● Lots of higher level APIs for querying (Pig/Hive)
● Not for random indexed reads
● Not for small data i.e. < 10 gigs

GridGain
● Fast In-Memory queries
● Not attached to any specific datastorage
● API is java/script based

Storm
● In-Memory
● Distributed
● Stream based aggregation/processing
● Supports sending partially aggregated data to
backends like Hbase/Cassandra

Akka Actors
● Concurrent processing constructs based on the
erlang actor model
● Latest versions support distributed RPC
communication via Netty or ZeroMQ.
● Used for building distributed fast processing
systems.

M/R High level languages
SQL
• Hive
Imperative
• Pig
Lisp
• Cascalog
R
• Hive JDBC Connection

Apache Pig
Advantages
● Simple and programmable
● UDFS and Loader/Store APIs are simple
● Spill to disk to avoid OOM
Disadvantages
● Low level
● Schema-less
●

Hive
Advantages
● SQL interface
● Server mode
Fast
Disadvantages
● Complex UDF Load/Store (SERDE) API
● Does not spill to disk like pig to avoid OOM

Workflow Management
● Glue
● Oozie
● Azkaban
● Bash

Glue
● Workflows for devops.
● No XML.
● Polygot language approach supports Groovy,
Scala,Ruby(JRuby), Python(Jython), Clojure,
JavaScript.
● Data driven and cronbased worfklows
● Separate configuration from workflows

Oozie
● XML
● UI for build workflows using blocks, (still have to
program the components)
● Buy another pair of glasses

Azkaban
● Based on Flows
– There consist of binaries described by a job text file
● Concentrates on generic scheduling and retries
in a traditional sense.
● Flow UI

Bash
● Don't do workflows in bash
● Know your bash for simple adhoc searches and
processing
● Again do not do workflows in bash

Realtime Read Write Storage
● Hbase and Accumulo
● Cassandra

Hbase and Accumulo
● Both are based on the BigTable paper from
Google.
● Column based storage
● Integrates with HDFS
● Tables act as distributed indexes
● Region Servers are single points of failure
● Aimed at faster reads than writes

Cassandra
● Based on the Dynamo papers from amazon
● No single point of failure
● Aimed at faster writes than reads
● Default eventual consistency with configurable
durability options (at the cost of writing speed)
● Column Counters

Others
● Lucene (api for building fast indexes)
● Solr and Elastic search
– Built on top of lucene
– Distributed indexes
– Fast query times
● Mongo DB (document db)
● Redis (fast in-memory db)
– Lots of basic constructs, easy to build bloom filters
– Great for realtime

References
http://cassandra.apache.org/
http://hbase.apache.org/
http://hadoop.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://storm-project.net/
http://lucene.apache.org/core/
http://lucene.apache.org/solr/
http://www.elasticsearch.org/
https://github.com/nathanmarz/cascalog
http://akka.io/
http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
http://gerritjvv.github.io/glue/
https://code.google.com/p/bigstreams/
https://github.com/facebook/scribe
http://flume.apache.org/
http://redis.io/
http://www.mongodb.org/
http://netty.io/
http://zeromq.org/
http://kafka.apache.org/
http://www.gridgain.com/
http://docs.mongodb.org/manual/core/gridfs/
http://aws.amazon.com/s3/

Hadoop Ecosystem Overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hadoop Ecosystem Overview

Ähnlich wie Hadoop Ecosystem Overview (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop Ecosystem Overview