This document provides an overview of big data and how to get started with Apache Hadoop software. It discusses that big data is characterized by volume, variety and velocity of data. Apache Hadoop is an open-source software framework that can handle large data sets across clusters of computers. The document outlines two approaches to implementing Apache Hadoop and recommends steps to optimize performance, such as using Intel technologies and tuning hardware and software settings.
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Getting Started with Big Data: Planning Guide
1. INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY
1
Getting Started with Big Data
How to Move Forward with
Apache Hadoop* Software
2. INTEL CONFIDENTIAL
2
Five Things to Know
Big data is a disruptive force that can drive
competitive advantage
Apache Hadoop* software is an emerging
technology for big data analytics
There are two approaches to implementing
big data projects
Intel® technologies and software support big data
Optimize and tune your big data environment
for best performance
1
2
3
4
5
3. INTEL CONFIDENTIAL
3
Big Data
Volume, Variety, and Velocity
Volume: Data sets that are orders of magnitude larger
than you have handled before
• The digital universe of data could reach 8 zettabytes of data by 20151
• That equals the data held by 18 million U.S. Libraries of Congress2
Variety: More diverse data types, including:
• Structured (transactions, customer information)
• Semistructured and unstructured (web logs, e-mails, documents,
images, video)
Velocity: Arriving faster than ever before
• Real-time streaming data
1 Gens, Frank. IDC Predictions 2012: Competing for 2020. IDC (December 2011).
2 “Big Data Infographic and Gartner 2012 Top 10 Strategic TechTrends.” Business Analytics 3.0 (blog) (November 11, 2011).
4. INTEL CONFIDENTIAL
4
Getting Bigger
Billions of Connected Devices and Internet Users
Source: Savitz, Eric. “Cisco Predicts the Rise of the Zettabyte Era.” Forbes (May 30, 2012).
forbes.com/sites/ericsavitz/2012/05/30/cisco-predicts-the-rise-of-the-zettabyte-era/
By 2016,
19 billion connected
devices—including 3.4
billion Internet users and
machine-to-machine
connections−will contribute
to the flood of
big data.
5. INTEL CONFIDENTIAL
5
The Reason for All the Buzz
Big Data Drives Competitive Advantage
The real value of big data is in the insights it produces when analyzed:
Finding patterns
Deriving meaning
Making decisions
Responding to the world with intelligence
6. INTEL CONFIDENTIAL
6
The Apache Hadoop* Framework
An Emerging Approach to Big Data Analytics
Open-source software that provides a simple programming model for
distributed processing of large data sets
• Provides a massively scalable storage and a data processing system (not a
database) built on clusters of computers
• Supplements your existing systems by handling data that’s typically
a problem for them
- Too large
- Unstructured
- Mix of types
- Real-time streaming
7. INTEL CONFIDENTIAL
7
It handles all kinds of data.
It scales quickly and affordably.
It reveals new insight. .
It reduces costs.
It delivers higher availability.
It lowers organizational risk.
Apache Hadoop* Breakthroughs
Advantages over Traditional Systems
No need to develop specific schemas.
Add more servers and storage as you need it!
Find hidden relationships that were difficult—
or even impossible—to find in the past.
• Open-source software that runs on standard servers.
• Lower cost per terabyte for storage and processing.
Fault tolerant; designed to recover from hardware,
software, and system failures.
Apache Hadoop* innovations continue through an active
and diverse global community.
8. INTEL CONFIDENTIAL
8
Two Approaches to Apache Hadoop*
What’s Right for Your Organization?
Apache Hadoop* software-only deployments
• Free Apache Hadoop open-source software
• Vendor distributions that prepackage Hadoop*
software with value-added enhancements and
services
1
Hadoop software integrated with
traditional databases
• Extend existing data warehousing and analytics
platforms to include Hadoop software
2
9. INTEL CONFIDENTIAL
9
Apache Hadoop* Deployment
Put the Right Infrastructure in Place
Clusters of standard servers
10 gigabit Ethernet networking
Intelligent storage
Apache Hadoop* software
10. INTEL CONFIDENTIAL
10
Intel® Technologies for Big Data
Get Maximum Performance
Server clusters: Intel® Xeon® processor E5 family
Networking: Intel ® Ethernet 10 Gigabit Converged
Network Adapters
Storage: Intel ® Solid-State Drives
Software: Intel ® Distribution for Apache Hadoop*
software
(Intel Distribution)1
1 Currently available in China, Taiwan, and the United States.
11. INTEL CONFIDENTIAL
11
Intel® Distribution for
Apache Hadoop* Software
Enterprise ready for a variety of use cases1
Supports a wide range of analytics
• Enhances Apache Hive* and Apache HBase* software
Introduces graph analytics capabilities with Intel® GraphBuilder soft ware
• Provides a Java library for constructing graphs that help visualize data relationships
Optimizes open-source Apache Hadoop* components
• Takes advantage of Intel Xeon® processor capabilities
Hadoop* security, scalability, and management enhancements
• Tightly integrated into the platform
Support and services from Intel and its partners
Find out more about the Intel Distribution
1 Currently available in China, Taiwan, and the United States.
12. INTEL CONFIDENTIAL
12
Apache Hadoop* Optimization
Practical Trade-offs for Hardware, Software, and System Settings
Fine-tune your solution for best performance:
Maximize productivity
Limit energy consumption
Maximize resource utilization
Reduce operating costs
Lower your total cost of ownership
13. INTEL CONFIDENTIAL
13
Benchmark Performance
Intel’s HiBench Suite
Comprehensive set of benchmark tests for Apache
Hadoop*software
Represents important Hadoop* workloads and analytics
with a mix of hardware usage characteristics
Available as open-source software under Apache License
2.0 at https://github.com/hibench/HiBench-2.1
14. INTEL CONFIDENTIAL
14
Get Started
Five Steps for IT Managers
Work with your business users to articulate the big opportunities
Do your research to get up to speed on the technology
Develop use case(s) for your project
Identify gaps between current- and future-state capabilities
Develop a test environment for a production version
1
2
3
4
5
15. INTEL CONFIDENTIAL
15
Big Data Planning Guide
Everything You Need to Get Started
Intel.com/ITCenter
Read the full planning guide at Intel.com/bigdata
Learn more about the Intel® Distribution for
Apache Hadoop* software at hadoop.intel.com