7. Distributed Storage
● HDFS (master slaves)
● S3 (bucket, key → blob, no master)
● GridFS ?
● NFS (not for high writes or big data)
8. Hadoop Distributed File System
Advantages
● Simple (more or less)
● Works with every day hardware (cheap to scale)
● Proven scalability to petabytes
● Lends itself to efficient distributed batch processing
Disadvantages
● Single Point of Failure (HA is a work in progress)
● All meta-data must fit in master's RAM
● No Random Read/Writes
9. S3
Advantages
● Cheap
● distributed
● Good for data archival
Disadvantages
● Data is stored externally
● Does not lend itself to batch processing of large
volumes of data
11. Hadoop MapReduce
● Used for distributed serial batch processing
● Works with HDFS
● Simple concept but complex APIs
● Lots of higher level APIs for querying (Pig/Hive)
● Not for random indexed reads
● Not for small data i.e. < 10 gigs
12. GridGain
● Fast In-Memory queries
● Not attached to any specific datastorage
● API is java/script based
13. Storm
● In-Memory
● Distributed
● Stream based aggregation/processing
● Supports sending partially aggregated data to
backends like Hbase/Cassandra
14. Akka Actors
● Concurrent processing constructs based on the
erlang actor model
● Latest versions support distributed RPC
communication via Netty or ZeroMQ.
● Used for building distributed fast processing
systems.
15. M/R High level languages
SQL
• Hive
Imperative
• Pig
Lisp
• Cascalog
R
• Hive JDBC Connection
16. Apache Pig
Advantages
● Simple and programmable
● UDFS and Loader/Store APIs are simple
● Spill to disk to avoid OOM
Disadvantages
● Low level
● Schema-less
●
17. Hive
Advantages
● SQL interface
● Server mode
Fast
Disadvantages
● Complex UDF Load/Store (SERDE) API
● Does not spill to disk like pig to avoid OOM
19. Glue
● Workflows for devops.
● No XML.
● Polygot language approach supports Groovy,
Scala,Ruby(JRuby), Python(Jython), Clojure,
JavaScript.
● Data driven and cronbased worfklows
● Separate configuration from workflows
20. Oozie
● XML
● UI for build workflows using blocks, (still have to
program the components)
● Buy another pair of glasses
21. Azkaban
● Based on Flows
– There consist of binaries described by a job text file
● Concentrates on generic scheduling and retries
in a traditional sense.
● Flow UI
22. Bash
● Don't do workflows in bash
● Know your bash for simple adhoc searches and
processing
● Again do not do workflows in bash
24. Hbase and Accumulo
● Both are based on the BigTable paper from
Google.
● Column based storage
● Integrates with HDFS
● Tables act as distributed indexes
● Region Servers are single points of failure
● Aimed at faster reads than writes
25. Cassandra
● Based on the Dynamo papers from amazon
● No single point of failure
● Aimed at faster writes than reads
● Default eventual consistency with configurable
durability options (at the cost of writing speed)
● Column Counters
26. Others
● Lucene (api for building fast indexes)
● Solr and Elastic search
– Built on top of lucene
– Distributed indexes
– Fast query times
● Mongo DB (document db)
● Redis (fast in-memory db)
– Lots of basic constructs, easy to build bloom filters
– Great for realtime