2. Big-data
Four parameters:
âVelocity: Streaming data and large volume data movement.
âVolume: Scale from terabytes to zettabytes.
âVariety: Manage the complexity of multiple relational and non-relational data types and schemas.
âVoracity: Produced data has to be consumed fast before it becomes meaningless.
3. Not just internet companies
Big Data Shouldnât Be a SiloMust be an integrated part of enterprise information architecture
4. Data >> Information >> Business Value
RetailâBy combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues.
Financial ServicesâBy combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets.
GovernmentâBy collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies.
HealthcareâBig data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
5. Single-core, single processor
Single-core, multi-processor
Single- core
Multi-core, single processor
Multi-core, multi-processor
Multi-core
Cluster of processors (single or multi-core) with shared memory
Cluster of processors with distributed memory
Cluster
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN.
Grid of clusters
Embarrassingly parallel processing
MapReduce, distributed file system
Cloud computing
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
Reference: Bina Ramamurthy 2011
Processing Granularity
6. How to Process BigData?
Need to process large datasets (>100TB)
âJust reading 100TB of data can be overwhelming
âTakes ~11 days to read on a standard computer
âTakes a day across a 10Gbit link (very high end storage solution)
âOn a single node (@50MB/s) â23days
âOn a 1000 node cluster â33min
7. Examples
â˘Web logs;
â˘RFID;
â˘sensor networks;
â˘social networks;
â˘social data (due to thesocial data revolution),
â˘Internet text and documents;
â˘Internet search indexing;
â˘call detail records;
â˘astronomy,
â˘atmospheric science,
â˘genomics,
â˘biogeochemical,
â˘biological, and
â˘other complex and/or interdisciplinary scientific research;
â˘military surveillance;
â˘medical records;
â˘photography archives;
â˘video archives; and
â˘large-scale e-commerce.
8. Not so easyâŚ
Moving data from storage cluster to computation cluster is not feasible
In large clusters
âFailure is expected, rather than exceptional.
âIn large clusters, computers fail every day
âData is corrupted or lost
âComputations are disrupted
âThe number of nodes in a cluster may not be constant.
âNodes can be heterogeneous.
Very expensive to build reliability into each application
âA programmer worries about errors, data motion, communicationâŚ
âTraditional debugging and performance tools donât apply
Need a common infrastructure and standard set of tools to handle this complexity
âEfficient, scalable, fault-tolerant and easy to use
9. Why is Hadoop and MapReduceneeded?
The answer to this questions comes from another trend in disk drives:
âseek time is improving more slowly than transfer rate.
Seeking is the process of moving the diskâs head to a particular place on the disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a diskâs bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
10. Why is Hadoop and MapReduceneeded?
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well.
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
MapReducecan be seen as a complement to an RDBMS.
MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
12. Hadoop distributions
Apache⢠Hadoopâ˘
Apache Hadoop-based Services for Windows Azure
ClouderaâsDistribution Including Apache Hadoop (CDH)
HortonworksData Platform
IBM InfoSphereBigInsights
Platform Symphony MapReduce
MapRHadoop Distribution
EMC GreenplumMR (using MapRâsM5 Distribution)
ZettasetData Platform
SGI Hadoop Clusters (uses Clouderadistribution)
Grand Logic JobServer
OceanSyncHadoop Management Software
Oracle Big Data Appliance (uses Clouderadistribution)
13. Whatâs up with the names?
When naming software projects, Doug Cutting seems to have been inspired by his family.
Luceneis his wifeâs middle name, and her maternal grandmotherâs first name.
His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop.
Doug said he âwas looking for a name that wasnât already a web domain and wasnât trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.â
14. Hadoop features
Distributed Framework for processing and storing data generally on commodity hardware.
Completely Open Source.
Written in Java
âRuns on Linux, Mac OS/X, Windows, and Solaris.
âClient apps can be written in various languages.
â˘Scalable: store and process petabytes, scale by adding Hardware
â˘Economical: 1000âs of commodity machines
â˘Efficient: run tasks where data is located
â˘Reliable: data is replicated, failed tasks are rerun
â˘Primarily used for batch data processing, not real-time / user facing applications
15. Components of Hadoop
â˘HDFS(Hadoop Distributed File System)
âModeledonGFS
âReliable,HighBandwidthfilesystemthatcan
store TB' and PB's data.
â˘Map-Reduce
âUsingMap/ReducemetaphorfromLisplanguage
âAdistributedprocessingframeworkparadigmthat
process the data stored onto HDFS in key-value .
DFS
Processing Framework
Client 1
Client 2
Input
data
Output
data
Map
Map
Map
Reduce
Reduce
Input
Map
Shuffle & Sort
Reduce
Output
16. â˘Very Large Distributed File System
â10K nodes, 100 million files, 10 PB
âLinearly scalable
âSupports Large files (in GBs or TBs)
â˘Economical
âUses Commodity Hardware
âNodes fail every day. Failure is expected, rather than exceptional.
âThe number of nodes in a cluster is not constant.
â˘Optimized for Batch Processing
HDFS
17. HDFS Goals
â˘Highly fault-tolerant
âruns on commodity HW, which can fail frequently
â˘High throughput of data access
âStreaming access to data
â˘Large files
âTypical file is gigabytes to terabytes in size
âSupport for tens of millions of files
â˘Simple coherency
âWrite-once-read-many access model
18. HDFS: Files and Blocks
â˘Data Organization
âData is organized into files and directories
âFiles are divided into uniform sized large blocks
âTypically 128MB
âBlocks are distributed across cluster nodes
â˘Fault Tolerance
âBlocks are replicated (default 3) to handle hardware failure
âReplication based on Rack-Awareness for performance and fault tolerance
âKeeps checksums of data for corruption detection and recovery
âClient reads both checksum and data from DataNode. If checksum fails, it tries other replicas
19. HDFS: Files and Blocks
â˘High Throughput:
âClient talks to both NameNodeand DataNodes
âData is not sent through the NameNode.
âThroughput of file system scales nearly linearly with the number of nodes.
â˘HDFS exposes block placement so that computation can be migrated to data
20. HDFS Components
â˘NameNode
âManages the file namespace operation like opening, creating, renaming etc.
âFile name to list blocks + location mapping
âFile metadata
âAuthorization and authentication
âCollect block reports from DataNodeson block locations
âReplicate missing blocks
âKeeps ALL namespace in memory plus checkpoints & journal
â˘DataNode
âHandles block storage on multiple volumes and data integrity.
âClients access the blocks directly from data nodes for read and write
âData nodes periodically send block reports to NameNode
âBlock creation, deletion and replication upon instruction from the NameNode.
23. Map Reduce -Introduction
â˘Parallel Job processing framework
â˘Written in java
â˘Close integration with HDFS
â˘Provides :
âAuto partitioning of job into sub tasks
âAuto retry on failures
âLinear Scalability
âLocality of task execution
âPlugin based framework for extensibility
24. Map-Reduce
â˘MapReduceprograms are executed in two main phases, called
âmapping and
âreducing.
â˘In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper.
â˘In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result.
â˘The mapper is meant to filter and transform the input into something
â˘That the reducer can aggregate over.
â˘MapReduceuses lists and (key/value) pairs as its main data primitives.
25. Map-Reduce
Map-Reduce Program
âBased on two functions: Map and Reduce
âEvery Map/Reduce program must specify a Mapper and optionally a Reducer
âOperate on key and value pairs
Map-Reduce works like a Unix pipeline:
cat input | grep| sort | uniq-c | cat > output
Input| Map| Shuffle & Sort | Reduce| Output
cat /var/log/auth.log* | grepâsession openedâ | cut -dâ â -f10 | sort | uniq-c > ~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
27. Hadoop and its elements
HDFS
.
.
.
File 1
File 2
File 3
File N-2
File N-1
File N
Input
files
Splits
Mapper
Machine -1
Machine -2
Machine -M
Split 1
Split 2
Split 3
Split M-2
Split M-1
Split M
Map 1
Map 2
Map 3
Map M-2
Map M-1
Map M
Combiner 1
Combiner C
(Kay, Value)
pairs
Record Reader
combiner
.
.
.
Partition 1
Partition 2
Partition P-1
Partition P
Partitionar
Reducer
HDFS
.
.
.
File 1
File 2
File 3
File O-2
File O-1
File O
Reducer 1
Reducer 2
Reducer R-1
Reducer R
Input
Output
Machine -x
28. Hadoop Eco-system
â˘Hadoop Common: The common utilities that support the other Hadoop subprojects.
â˘Hadoop Distributed File System (HDFSâ˘): A distributed file system that provides high- throughput access to application data.
â˘Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
â˘Other Hadoop-related projects at Apache include:
âAvroâ˘: A data serialization system.
âCassandraâ˘: A scalable multi-master database with no single points of failure.
âChukwaâ˘: A data collection system for managing large distributed systems.
âHBaseâ˘: A scalable, distributed database that supports structured data storage for large tables.
âHiveâ˘: A data warehouse infrastructure that provides data summarization and ad hoc querying.
âMahoutâ˘: A Scalable machine learning and data mining library.
âPigâ˘: A high-level data-flow language and execution framework for parallel computation.
âZooKeeperâ˘: A high-performance coordination service for distributed applications.
29. Exercise âtask
You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well.
â˘Task:
âProvide an architecture of such system to meet following goals
âFast
âAvailable
âFair
âOr, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months.
â˘Group / individual presentation