The document provides an overview of the Hadoop ecosystem, including introductory information on Hadoop and MapReduce, installing and using Hadoop, programming with Pig and Hive, using NoSQL databases like MongoDB, machine learning with Mahout, and moving data in and out of Hadoop systems. It also covers managing Hadoop clusters, running Hadoop on AWS, data structures and algorithms for Hadoop, and testing and debugging Hadoop applications.
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Introduction to Hadoop Distributed Programming
1.
2. Introduction to Distributed Programming
› Background of Hadoop
› What is Hadoop ?
› How Hadoop works ?
Installing Hadoop
› Setting up SSH
› Setting up Environment Variables
› Running Hadoop
› Web-Based Cluster
3. Components of Hadoop
› Working with Hadoop File-System
› Understanding Hadoop Map-Reduce
› Reading and Writing
Writing Basic Map Reduce Program
› Getting the Patent Data Set
› Constructing Basic Map-Reduce Program
› Working with Hadoop Streaming
› Improving Performance with Combiners
4. Advanced MapReduce
› Summarization Patterns
› Filtering Patterns
› Data Organization Patterns
› Join Patterns
› Meta Patterns
› Input and Output Patterns
Programming Practices
› Developing Map-Reduce Programs
› Monitoring and Debugging on a cluster
› Tuning for performance
5. Hadoop Cookbook
› Passing Job-Specific Parameters to your tasks
› Probing for Task-Specific Parameters
› Partitioning into multiple output files
› Inputting from and output to database
› Keeping Output in Sorted Order
Managing Hadoop
› Checking System’s Health
› Setting permissions
› Managing Quotas , Enabling Trash ,
Adding/Deleting Nodes, Recovering from a
failed NameNode
6. Running Hadoop in the Cloud
› Introducing Amazon Web Services
› Setting up AWS and Setting up cloud on EC2
› Running Map-Reduce Programs on EC2
› Cleaning up and Shutting down your EC2
instances.
› Amazon Elastic Map-Reduce and other AWS
Services
7. Programming with Pig
› Thinking like a pig
› Installing Pig
› Running Pig
› Learning Pig Latin through Grunt
› Pig Latin Syntax
› Working with UDF
› Working with Scripts
8. Getting Started on Hive
Data Types and File Formats
HiveQL – Data Definition
HiveQL - Data Manipulation
HiveQL – Queries, Views and Indexes
Schema Design , Tuning & Record
Formats
Hive Integration with Oozie
Hive and Amazon Web Services
9. NoSQL Database
› Why No SQL ?
› Aggregate Data Models
› Distribution Models
› Consistency
No SQL DBs
› Key-Value DataBases
› Document Databases
› Column Family Stores
› Graph Databases
10. MongoDB
› Introduction
› MongoDB through JavaScript Shell
› Writing Programs using MongoDB
› Document Oriented Data
› Queries and Aggregation
› Updates, Atomic Operations and Deletes
› Indexing, Replication and Sharding
11. Mahout – Machine Learning
› Introduction
› Recommenders
Representing Recommender Data
Making Recommendations
› Clustering
Clustering Algorithms in Mahout
› Classification
Training a Classifier
Evaluating and Tuning a Classifier
12. Moving Data in and out of Hadoop
› Flume
› Oozie
› Sqoop
› Hbase
Data Serialization Formats
› XML, JSON
› SequenceFiles, Protocol Buffers, Thrift and
Avro
13. Utilizing Data Structures and Algorithms
› Modelling Data & Solving Problems with
Graphs
› Parallelized Bloom Filter Creation in Map-
Reduce
Programming Pipelines with Pig
› Using Pig to find malicious actors in log data.
› Optimizing user workflow with Pig.