3. WHAT IS BIG DATA?
Volume Terabytes Petabytes Exabytes
4. WHAT IS BIG DATA?
Volume
Velocity System Logs Medical Monitors Machinery Controls
5. WHAT IS BIG DATA?
Volume
Velocity
Variety
Varacity Variability
RDBMSSocial MediaXML
JSON Documents IoT
Value
6. How do we collect, store and process all this data?
7. • Open Source Apache Software.
• Distributed processing across clusters of computers.
• Designed to scale to thousands of computers.
• Local computation and storage.
• Expects hardware failure which is handled at the application layer.
A cute yellow elephant
8. HADOOP ECOSYSTEM OVERVIEW
• Distributed storage and processing.
• Runs on commodity server hardware.
• Scales horizontally for seamless failover.
• Hadoop is open source software.
9. TRADITIONAL DATA REPOSITORIES
• Very structured in 3NF or Star topologies.
• Is the enterprise “Single Source of Truth”
• Optimized for operations reporting
requirements.
• Scales vertically.
• Limited interaction with external or
unstructured data sources.
• Complex management schemes and
protocols.
10. TRADITIONAL DATA SOURCES IN HEALTHCARE
• Data for the Healthcare EDW originates
from the functional clinical and
administrative responsibilities.
• Sources can be as sophisticated as highly
complex on-line systems or as simple as
Excel spreadsheets.
• Complex validation and transformation
processes before inclusion into the EDW.
• Staging of the data transformation
requires separate storage and processing
space, but is often times done on the
same physical hardware as the EDW.
11. INTEGRATION OF HADOOP AND TRADITIONAL IT
• Hadoop does is not replace traditional
storage or processing technologies.
• Hadoop can include data from traditional
IT sources to discover new value.
• Compared to traditional IT, setting up and
operating a Hadoop platform can be very
inexpensive.
• Can be seen as very expensive when
adding to existing traditional IT
environments.
12. EMERGING AND NON-TRADITIONAL DATA
• New knowledge is discovered by applying
known experience in context with
unknown or new experience.
• New sources of data are being created in
a seemingly unending manner.
• Social media and mobile computing
provide sources of new data unavailable
in the past.
• Monitors, system logs, and document
corpus all provide new ways of capturing
and expressing the human experience
that cannot be captured or analyzed by
traditional IT methodologies.
13. INTEGRATION OF HADOOP AND NON-TRADITIONAL DATA
• Hadoop is designed to store and process
non-traditional data sets.
• Optimized for unstructured file based
data sources.
• Core applications developed specifically
for different storage, processing, analysis
and display activities.
• Development of metadata definitions
and rules combined with data from
disparate data sources can be used for
deeper analytic discovery.
14. DATA ANALYSIS
• Inspecting, transforming and modeling
data to discover knowledge, make
predictions and suggest conclusions.
• 3rd party data analysis can be integrated
into traditional IT environments or big
data solutions.
• Traditionally conducted by working on
discrete data sets in isolation from the
decision making process.
• Data scientists are integrated into core
business processes to create solutions for
critical business problems using big data
platforms.
15. COMPLETE HADOOP ECOSYSTEM
• Integration between traditional and non-
traditional data is facilitated by the
Hadoop ecosystem.
• Data is stored on a fault tolerant
distributed file system in the Hadoop
cluster.
• Data is processed close to where the data
is located to reduce latency and time
consuming transfer processes.
• The Hadoop Master controller or
“NameNode” monitors the processes of
the Hadoop cluster and automatically
executes actions to continue processing
when failure is detected.
17. CORE COMPONENT - STORAGE
• HDFS – A distributed file system designed to run on commodity grade hardware in the Hadoop computing
ecosystem. This file system is highly fault tolerant and provides very high throughput to data and is suitable
for very large data sets. Fault tolerance is enabled by making redundant copies of data sectors and
distributing them throughout the Hadoop cluster.
• Key Characteristics Include:
• Streaming data access – Designed for batch processing instead of interactive use.
• Large data sets – Typically in gigabytes to terabytes in size.
• Single Coherency Model - To enable high throughput access.
• Moving computational process is cheaper than moving data.
• Designed to be easily portable.
• Hive – A data warehouse implementation in Hadoop that facilities the query and management of large
datasets kept in the distributed storage.
• Key Features:
• Tools for ETL
• A methodology for providing structure for multiple data formats.
• Access to files stored in HDFS or Hbase
• Executes queries via the MapReduce application.
18. CORE COMPONENT – STORAGE …..
• HBase – A distributed, scalable big data database. For random access realtime read/write access to big data.
• Key Features:
• Modular scalability.
• Strict consistent reads and writes.
• Automatic sharding of tables (partitioning tables to smaller more manageable parts).
• Automatic failover.
19. CORE COMPONENT - MANAGEMENT
• Zoo Keeper – A centralized service for maintaining configurations, naming providing distributed
synchronization and group services.
• Avro – A data serialization program.
• Oozie – A Hadoop workflow Scheduler
• Whirr – A cloud neutral library for running cloud services.
CORE COMPONENT - PROCESSING
• MapReduce – An implementation for processing and generating large data sets with a parallel, distributed
algorithm on a Hadoop cluster.
• Key Features:
• Automatic parallelization and distribution
• Fault-tolerance
• I/O Scheduling
• Status Monitoring
20. CORE COMPONENT - INTEGRATION
• Sqoop – a utility designed to efficiently transfer bulk data between Hadoop and relational databases.
• Flume – A service, based on streaming data flows, for collecting, aggregating and moving large amounts of
system log data.
CORE COMPONENT – PROGRAMMING
• Pig – A high level language for analyzing very large data sets and is designed is able to efficiently utilize
parallel processes to achieve its results.
• Key Properties:
• Ease of programming – Complex tasks are explicitly encoded as data flow sequences making them
easy to understand and implement.
• Significant optimization opportunities – the system optimizes execution automatically.
• Extensibility – Users can encode their own functions.
• HiveQL – A SQL like query language for data stored in Hive Tables which converts queries into MapReduce
jobs.
• Jaql – A data processing and query language used to processing JSON on Hadoop.
21. CORE COMPONENT - INSIGHT
• Mahout – A library of callable machine learning algorithms which uses the MapReduce paradigm.
• Supports four main data use cases:
• Collaborative filtering – analyzes behavior and make recommendations.
• Clustering – organizes data into naturally occurring groups.
• Classification – learns from known characteristics of existing categorizations and makes
assignments of unclassified items into a category.
• Frequent item or market basket mining – analyzes data items in transactions and identifies items
which typically occur together.
• Hue – Is a set of web applications that enable a user to interact with a Hadoop cluster. Also lets the user
browse and interact with Hive, Impala, MapReduce jobs and Oozie workflows.
• Beeswax – An application which allows the user to perform queries on the Hive data warehousing
application. You can create Hive tables, load data, run queries and download results in Excel spreadsheet
format or CSV format.
22. HADOOP DISTRIBUTIONS
Amazon Web Services Elastic MapReduce
•One of the first Hadoop commercial offerings
•Has the largest commercial Hadoop market share
•Includes strong integration with other AWS cloud products
•Auto scaling and support for NoSQL and BI integration
Cloudera
•2nd largest commercial marketshare
•Experience with very large deployments
•Revenue model based on software subscriptions
•Aggressive innovation to meet customer demands
HortonWorks
•Strong engineering partnerships with flagship companies.
•Innovation driven through the open source community.
•Is a key contributor to the Hadoop core project.
•Commits corporate resources to jump start Hadoop community projects.
23. HADOOP DISTRIBUTIONS …
International Business Machines
•Vast experience in distributed computing and data management.
•Experience with very large deployments.
•Has advanced analytic tools, and global recognition.
•Integration with vast array of IBM management and productivity software.
MapR Technologies
•Heavy focus and early adopter of enterprise features.
•Supports some legacy file systems such as NFS.
•Adding performance enhancements for HBase, high-availability and disaster recovery.
Pivotal
•Spin off from EMC and VMWare.
•Strong cadre of technical consultants and data scientists.
•Focus on MPP SQL engine and EDW with very high performance.
•Has an appliance with integrated Hadoop, EDW and data management in a single rack.
24. HADOOP DISTRIBUTIONS …
Teradata
• Specialist and strong background in EDW.
• Has a strong technical partnership with HortonWorks.
• Has very strong integration between Hadoop and Teradata’s management and EDW
tools.
• Extensive financial and technical resources allow creation of unique and powerful
appliances.
Microsoft Windows Azure HDInsight
• A product designed specifically for the cloud in partnership with HortonWorks.
• The only Hadoop distribution that runs in the Windows environment.
• Allows SQL Server users to also execute queries that include data stored in Hadoop.
• Unique marketing advantage for offering the Hadoop stack to traditional Windows
customers.
25. RECOMMENDATION
Commitment and Leadership in
the Open Source Community
Strong Engineering
Partnerships
Innovation driven from the
community
Innovative
Secure
Big Data/Health
Research
Collaboration
26. CLUSTER DIAGRAM
• NameNode is a single master server
which manages the file system and
file system operations.
• Data Nodes are slave servers that
manage the data and the storage
attached to the data.
• NameNode is a single point of
failure for the HDFS Cluster.
• A SecondaryNameNode can be
configured on a separate server in
the cluster which creates
checkpoints for the namespace.
• SecondaryNameNode is not a
failover NameNode.
28. HADOOP SANDBOX IN ORACLE VIRTUALBOX
Host Specification
• Windows 10
• Intel® Core™ i7-4770 CPU @ 3.40GHz
• 16GB Installed RAM
• 64-bit OS, x64
• 1.65 TB Storage
VM Specification
• Cloudera Quickstart Sandbox
• Red Hat
• Intel® Core™ i7-4770 CPU @ 3.40GHz
• 10GB Allocated RAM
• 32MB Video Memory
• 64-bit OS
• 64GB Storage
• Shared Clipboard: Bidirectional
• Drag’n’Drop: Bidirectional
29. CLOUDERA HADOOP DESKTOP & INTERFACE
Opening Cloudera interface
and view of the CDC “Healthy
People 2010” data set that
was uploaded to the Redhat
OS
30. HUE FILE BROWSER
• Folder List
• File Contents
• Displayed file content is
from the Vulnerable
Population and
Environmental Health
data of the “Healthy
People 2010” data set.
31. ADDING DATA TO HIVE
• Folder List
• File Contents
• Displayed file content is
from the Vulnerable
Population and
Environmental Health
data of the “Healthy
People 2010” data set.
32. ADDING DATA TO HIVE …
Choosing a delimiter type
Defining columns
33. ADDING DATA TO HIVE …
• Hive Table List
• Table properties