SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Titre
Sous-titre
Date
Nom du prĂŠsentateur
Gong, Zhihong
Data Warehouse Consultant
September 2012
Big Data
The frontier for innovation
Agenda
• Big Data Overview
• Hadoop Theory and Practice
• MapReduce in Action
• NoSQL
• MPP Database
• What’s hot?
Big Five IT Trends
• Mobile
• Social Media
• Cloud Computing
• Consumerization of IT
• Big Data
Big Data Era
• The coming of the Big Data Era is a chance for
everyone in the technology world to decide into which
camp they fall, as this era will bring the biggest
opportunity for companies and individuals in the
technology since the dawn of the Internet.
− Rob Thomas, IBM Vice President, Business Development
Big Data job trend
6
Big Data – a growing torrent
• 2 billion internet users
• 5 billion mobile phones in use in 2010.
• 30 billion pieces of content shared on Facebook every month.
• 7TB of data are processed by Twitter every day,
• 10TB of data are processed by Facebook every day.
• 40% projected growth in global data generated per year.
• 235T data collected by US library of Congress in April 2011
• 15 out of 17 sectors in the US have more data stored per company
than the US library of Congress.
• 90% of the data in the world today has been created in the last two
years alone.
Data Rich World
• Data capture and collection
− Sensor data, Mobile device, Social Network, Web clickstream,
Traffic monitoring, Multimedia content, Smart energy meters,
DNA analysis, Industry machines in the age of Internet of
Things, Consumer activities – communicating, browsing,
buying, sharing, searching – create enormous trails of data.
• Data Storage
− Cost of storage is reduced tremendously
− Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
Technology world has changed
• Users: 2,000 users vs. a potential user base of 2 billion
• Applications: Online transaction system vs. Web applications.
• Application architecture: centralized vs. scale-up
• Infrastructure: a commodity box has more computational power
than a supercomputer a decade ago
• 80% percent of the world’s information is unstructured.
• Unstructured information is growing at 15 times the rate of
structured information.
• Database architecture has not kept pace
A Sample Case – Big Data
• ShopSavvy5 – mobile shopping App
− 40,000+ retailers
− Millions shoppers
− Millions retail store locations
− 240M+ product pictures and user action shots
− 3040M+ product attributes (color, size, features etc.)
− 14,720M+ prices from retailers
− 100+ price requests per second
− delivering real-time inventory and price information
A Sample Case – Big Data (Cont)
• ShopSavvy Architecture
− An entirely new platform, ProductCloud, leverages the
latest Big Data tool like Cassandra, Hadoop, and Mahout,
maintains HUGE histories of prices, products, scans and
locations that number in the hundreds of billions of items.
− Open architecture layers tools like Mahout on top of the
platform to enable new features like price prediction, user
recommendations, product categorization and product
resolution.
Visualization I
• Retweet network related to Egyptian Revolution
Visualization II
What is “Big Data”
• The term Big Data applies to information that can’t be
processed or analyzed using traditional processes or tool.
• Big Data creates values in several ways
− Create transparency
− Enabling experimentation to discover needs, expose
variability, and improve performance
− Segmenting population to customize actions
− Replacing/supporting human decision making with machine
algorithms
− Innovating new business models, products, and services, e.g.
risk estimation.
14
Big Data = Big Value
• $300 billion potential annual value to US health – more than double
the total annual health care spending in Spain.
• $350 billion potential annual value to Europe’s public sector
administration – more than GDP of Greece.
• $600 billion potential annual consumer surplus from using personal
location data globally.
• 60% potential increase in retailer’s operating margins possible with
big data.
• 140,000 to 190,000 more deep analytic talent positions, and 1.5
million data-savvy managers needed to take full advantage of big
data in the United States.
• Gartner predicts that “Big Data will deliver transformational benefits
to enterprises within 2 to 5 years"
Characteristics of Big Data
• Volume – Terabytes  Zettabytes
• Variety – structured, semi-structured, unstructured data
• Velocity – Batch -> Streaming Data, Real-time
Traditional Data vs. Big Data
Traditional Data Warehouse vs. Big Data
• Traditional warehouses
− mostly idea for analyzing structured data and producing
insights with known and relatively stable measurements.
• Big Data solutions
− idea for analyzing not only raw structured data, but semi-
structured and structured data from a wide variety of
sources.
− idea when all of the data needs to be analyzed versus a
sample of data.
− Idea for iterative and exploratory analysis when business
measures are not predetermined.
CAP Theorem
• CAP
− Consistency
− Availability
− Tolerance to network Partitions
• Consistency models
− Strong consistency
− Weak consistency
− Eventual consistency
• Architectures
− CA: traditional relational database
− AP: NoSQL database
ACID vs. BASE
• ACID
− Atomicity
− Consistency
− Isolation
− Durability
• BASE
− Basically available
− Soft-state
− Eventual consistency
Lower Priorities
• No Complex querying functionality
− No support for SQL
− CRUD operations through database specific API
• No support for joins
− Materialize simple join results in the relevant row
− Give up normalization of data?
• No support for transactions
− Most data stores support single row transactions
− Tunable consistency and availability (e.g., Dynamo)
 Achieve high scalability
Why sacrifice Consistency?
• It is a simple solution
− nobody understands what sacrificing P means
− sacrificing A is unacceptable in the Web
− possible to push the problem to app developer
• C not needed in many applications
− Banks do not implement ACID (classic example wrong)
− Airline reservation only transacts reads (Huh?)
− MySQL et al. ship by default in lower isolation level
• Data is noisy and inconsistent anyway
− making it, say, 1% worse does not matter
Important Design Goals
• Scale out: designed for scale
− Commodity hardware
− Low latency updates
− Sustain high update/insert throughput
• Elasticity – scale up and down with load
• High availability – downtime implies lost revenue
− Replication (with multi-mastering)
− Geographic replication
− Automated failure recovery
A Brief History of Hadoop
• Hadoop is an open source project of the Apache Foundation.
• Hadoop has its origins in Apache Nutch, an open source web search engine, itself
a part of the Lucene project.
• In 2003, Google published a paper that described the architecture of Google’s
distributed filesystem, called GFS.
• In 2004, Google published the paper that introduced MapReduce.
• It is a framework written in Java originally developed by Doug Cutting, the creator
of Apache Lucene, who named it after his son's toy elephant.
• 2004 - Initial versions of what is now Hadoop Distributed Filesystem and Map-
Reduce implemented.
• January 2006 — Doug Cutting joins Yahoo!.
• February 2006 —Adoption of Hadoop by Yahoo! Grid team.
• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
A Brief History of Hadoop (Cont)
• January 2007—Research cluster reaches 900 nodes.
• In January 2008, Hadoop was made its own top-level project at
Apache. By this time, Hadoop was being used by many other
companies such as Facebook and the New York Times.
• In February 2008, Yahoo! announced that its production search index
was being generated by a 10,000-node Hadoop cluster.
• In April 2008, Hadoop broke a world record to become the fastest
system to sort a terabyte of data.
• March 2009 — 17 clusters with a total of 24,000 nodes.
• April 2009 — Won the minute sort by sorting 500 GB in 59 seconds
(on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400
nodes).
Hadoop Echosystem
• Common - A set of components for distributed filesystems and general I/O
• Avro - A serialization system for efficient data storage.
• MapReduce - A distributed data processing model and execution
environment that runs on large clusters of commodity machines.
• HDFS - A distributed filesystem.
• Pig - A data flow language for exploring very large datasets.
• Hive - A distributed data warehouse system.
• Hbase - A distributed, column-oriented database.
• ZoopKeeper - A distributed, highly available coordination service.
• Sqoop - A tool for efficiently moving data between relational databases
and HDFS.
Hadoop Distributed File System - HDFS
• Hadoop filesystem that runs on top of existing file system
• Designed to handle very large files with streaming data access
patterns
• Use blocks to store a file or parts of file
− 64MB (default), 128MB (recommended) - compare to 4KB in UNIX
• 1 HDFS block is supported by multiple operation system blocks
• Advantages of blocks
− Big throughput
− Fixed size - easy to calculate how many fit on a disk
− A file can be larger than any single disk in the network
− Fits well with replication to provide fault tolerance and availability
Hadoop Cluster
Hadoop Architecture
Hadoop Node Type
• HDFS nodes
• NameNode
• One per cluster, manages the filesystem namespace and meta data, large memory
requirements, keep entire filesystem metadata in memory
• DataNode
• Many per cluster, manages blocks with data and servers them to clients
• MapReduce nodes
• JobTracker
• One per cluster, receives job requests, schedules and monitors MapReduce jobs on
task trackers
• TaskTracker
• Many per cluster, each TaskTracker spawns Java Virtual Machines to run your map or
reduce task.
Write File to HDFS
Run MapReduce jobs
Before MapReduce…
• Large scale data processing was difficult!
− Managing hundreds or thousands of processors
− Managing parallelization and distribution
− I/O Scheduling
− Status and monitoring
− Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview
• What is it?
− Programming model used by Google
− A combination of the Map and Reduce models with an
associated implementation
− Used for processing and generating large data sets
• How does it solve our previously mentioned problems?
− MapReduce is highly scalable and can be used across
many computers.
− Many small machines can be used to process jobs that
normally could not be processed by a large machine.
Simple Data Flow Example
Another Data Flow Example
Map, Shuffle and Reduce
Map Abstraction
• Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
– Might need to parse input
• Produces a new list of key/value pairs
– Can be different type from input pair
Map Example
Reduce Abstraction
• Starts with intermediate Key / Value pairs
• Ends with finalized Key / Value pairs
• Starting pairs are sorted by key
• Iterator supplies the values for a given key to the Reduce function.
• Typically a function that:
− Starts with a large number of key/value pairs
− One key/value for each word in all files being greped (including multiple entries
for the same word)
− Ends with very few key/value pairs
− One key/value for each unique word across all the files with the number of
instances summed into this entry
• Broken up so a given worker works with input of the same key.
Reduce Example
MapReduce Data Flow
Why is this approach better?
• Creates an abstraction for dealing with complex
overhead
− The computations are simple, the overhead is messy
• Removing the overhead makes programs much smaller
and thus easier to use
− Less testing is required as well. The MapReduce libraries can
be assumed to work properly, so only user code needs to be
tested
• Division of labor also handled by the MapReduce
libraries, so programmers only need to focus on the
actual computation
MapReduce Advantages
• Automatic Parallelization:
− Depending on the size of RAW INPUT DATA  instantiate
multiple MAP tasks
− Similarly, depending upon the number of intermediate <key,
value> partitions  instantiate multiple REDUCE tasks
• Run-time:
− Data partitioning
− Task scheduling
− Handling machine failures
− Managing inter-machine communication
• Completely transparent to the programmer/analyst/user
MapReduce: A step backwards?
• Don’t need 1000 nodes to process petabytes:
− Parallel DBs do it in fewer than 100 nodes
• No support for schema:
− Sharing across multiple MR programs difficult
• No indexing:
− Wasteful access to unnecessary data
• Non-declarative programming model:
− Requires highly-skilled programmers
• No support for JOINs:
− Requires multiple MR phases for the analysis
MapReduce VS Parallel DB
• Web application data is inherently distributed on a large number of
sites:
− Funneling data to DB nodes is a failed strategy
• Distributed and parallel programs difficult to develop:
− Failures and dynamics in the cloud
• Indexing:
− Sequential Disk access 10 times faster than random access.
− Not clear if indexing is the right strategy.
• Complex queries:
− DB community needs to JOIN hands with MR
NoSQL Movement
• Initially used for: “Open-Source relational database that did not expose
SQL interface”
• Popularly used for: “non-relational, distributed data stores that often
did not attempt to provide ACID guarantees”
• Gained widespread popularity through a number of open source
projects
− HBase, Cassandra, MongDB, Redis, …
• Scale-out, elasticity, flexible data model, high availability
Data in Real World
• There real data sets that don’t make sense in the
relational model, nor modern ACID databases.
• Fit what into where?
− Trees
− Semi-structured data
− Web content
− Multi-dimensional cubes
− Graphs
NoSQL Database Technology
• Not only SQL
− No schema, more dynamic data model
− Denormalizing, no join
− CAP theory
− Auto-sharding (elasticity)
− Distributed query support
− Integrated caching
NoSQL Databases
• Key-Value store
− Redis (in memory), Riak
• Column oriented
− Cassandra, HBase, Dynamo, BigTable
• Document oriented
− MongoDB (JSON), CouchBase
• Graph
Key Value Stores
• Key-Valued data model
− Key is the unique identifier
− Key is the granularity for consistent access
− Value can be structured or unstructured
• Gained widespread popularity
− In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo
(Amazon)
− Open source: HBase, Hypertable, Cassandra, Voldemort
• Popular choice for the modern breed of web-applications
Cassandra – A NoSQL Database
• An open source, distributed store for structured data
that scales-out on cheap, commodity hardware
• Simplicity of Operations
• Transparency
• Very High Availability
• Painless Scale-Out
• Solid, Predictable Performance on Commodity and
Cloud Servers
Column Oriented
Column Oriented – Data Structure
• Tuples: {“key”: {“name”: “value”: “timestamp”} }
insert(“carol”, { “car”: “daewoo”, 2011/11/15 15:00 })
Row Key
jim
age: 36
2011/01/01 12:35
car: camaro
2011/01/01
12:35
gender: M
2011/01/01
12:35
carol
age: 37
2011/01/01 12:35
car: subaru
2011/01/01
12:35
gender: F
2011/01/01
12:35
johnny
age: 12
2011/01/01 12:35
gender: M
2011/01/01
12:35  
suzy
age: 10
2011/01/01 12:35
gender: F
2011/01/01
12:35  
Massively Parallel Processing (MPP) DB
• Vertica (HP)
• Greenplum (EMC)
• Netezza (IBM)
• Teradata (NCR)
• Kognitio
− In memory analytic
− No need for data partition or indexing
− Scans data in excess of 650 million rows per second per server. Linear
scalability means 100 nodes can scan over 650 billion rows per
second!
Vertica
• Supports logical relational models, SQL, ACID transactions, JDBC
• Columnar Store Architecture
− 50x--‐1000x faster by eliminating costly disk IO
− offers aggressive data compression to reduce storage costs by up to 90%
• 20x – 100x faster than traditional RDBMS data warehouse, runs on commodity
hardware
• Scale-out MPP Architecture
• Real-time loading and querying
• In-Database Analytics
• Automatic high availability
• Natively support grid computing
• Natively support MapReduce and Hadoop
Machine Learning
• Machine learning systems automate decision making on
data, automatically producing outputs like product
recommendations or groupings.
• WEKA - a Java-based framework and GUI for machine
learning algorithms.
• Mahout - an open source framework that can run
common machine learning algorithms on massive
datasets.
Popular Technologies
• Databases
− HBase, Cassandra, MongoDB, Redis, CouchDB, Vertica, Greenplum, Netezza
• Programing Languages
− Java; Python, Perl; Hive, Pig, JAQL;
• ETL tools
− Talend, Pentaho
• BI tools
− Pentaho, Tableau
• Analytics
− R, Mahout, BigInsight
• Methology
− Agile
• Other
− Hadoop, MapReduce, Lucene, Solr, JSON, UIMA, ZooKeeper
Questions
References
• Big data: The next frontier for innovation, competition and
productivity, McKinsey Global Institute, May 2011
• Understanding Big Data, IBM, 2012
• NoSQL Database Technology Whitepaper, CouchBase
• Big Data and Cloud Computing: Current State and Future
Opportunities, 2011
• Hadoop Definitive Guide
• How Do I Cassandra, Nov 2011
• BigDataUniversity.com
• youtube.com/ibmetinfo
• ……

Weitere ähnliche Inhalte

Was ist angesagt?

Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureDmitry Buzdin
 
Big data ppt
Big data pptBig data ppt
Big data pptShweta Sahu
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics HadoopMishika Bharadwaj
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 

Was ist angesagt? (20)

Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 

Ähnlich wie Big data

Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its TrendsDr.K.Sreenivas Rao
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxdickonsondorris
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptxkalai75
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxTanguturiAvinash
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01nayanbhatia2
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big dataVedanand Singh
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big DataZohar Elkayam
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 

Ähnlich wie Big data (20)

Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data
Big DataBig Data
Big Data
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 

KĂźrzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

KĂźrzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Big data

  • 1. Titre Sous-titre Date Nom du prĂŠsentateur Gong, Zhihong Data Warehouse Consultant September 2012 Big Data The frontier for innovation
  • 2. Agenda • Big Data Overview • Hadoop Theory and Practice • MapReduce in Action • NoSQL • MPP Database • What’s hot?
  • 3. Big Five IT Trends • Mobile • Social Media • Cloud Computing • Consumerization of IT • Big Data
  • 4. Big Data Era • The coming of the Big Data Era is a chance for everyone in the technology world to decide into which camp they fall, as this era will bring the biggest opportunity for companies and individuals in the technology since the dawn of the Internet. − Rob Thomas, IBM Vice President, Business Development
  • 5. Big Data job trend
  • 6. 6 Big Data – a growing torrent • 2 billion internet users • 5 billion mobile phones in use in 2010. • 30 billion pieces of content shared on Facebook every month. • 7TB of data are processed by Twitter every day, • 10TB of data are processed by Facebook every day. • 40% projected growth in global data generated per year. • 235T data collected by US library of Congress in April 2011 • 15 out of 17 sectors in the US have more data stored per company than the US library of Congress. • 90% of the data in the world today has been created in the last two years alone.
  • 7. Data Rich World • Data capture and collection − Sensor data, Mobile device, Social Network, Web clickstream, Traffic monitoring, Multimedia content, Smart energy meters, DNA analysis, Industry machines in the age of Internet of Things, Consumer activities – communicating, browsing, buying, sharing, searching – create enormous trails of data. • Data Storage − Cost of storage is reduced tremendously − Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
  • 8. Technology world has changed • Users: 2,000 users vs. a potential user base of 2 billion • Applications: Online transaction system vs. Web applications. • Application architecture: centralized vs. scale-up • Infrastructure: a commodity box has more computational power than a supercomputer a decade ago • 80% percent of the world’s information is unstructured. • Unstructured information is growing at 15 times the rate of structured information. • Database architecture has not kept pace
  • 9. A Sample Case – Big Data • ShopSavvy5 – mobile shopping App − 40,000+ retailers − Millions shoppers − Millions retail store locations − 240M+ product pictures and user action shots − 3040M+ product attributes (color, size, features etc.) − 14,720M+ prices from retailers − 100+ price requests per second − delivering real-time inventory and price information
  • 10. A Sample Case – Big Data (Cont) • ShopSavvy Architecture − An entirely new platform, ProductCloud, leverages the latest Big Data tool like Cassandra, Hadoop, and Mahout, maintains HUGE histories of prices, products, scans and locations that number in the hundreds of billions of items. − Open architecture layers tools like Mahout on top of the platform to enable new features like price prediction, user recommendations, product categorization and product resolution.
  • 11. Visualization I • Retweet network related to Egyptian Revolution
  • 13. What is “Big Data” • The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tool. • Big Data creates values in several ways − Create transparency − Enabling experimentation to discover needs, expose variability, and improve performance − Segmenting population to customize actions − Replacing/supporting human decision making with machine algorithms − Innovating new business models, products, and services, e.g. risk estimation.
  • 14. 14 Big Data = Big Value • $300 billion potential annual value to US health – more than double the total annual health care spending in Spain. • $350 billion potential annual value to Europe’s public sector administration – more than GDP of Greece. • $600 billion potential annual consumer surplus from using personal location data globally. • 60% potential increase in retailer’s operating margins possible with big data. • 140,000 to 190,000 more deep analytic talent positions, and 1.5 million data-savvy managers needed to take full advantage of big data in the United States. • Gartner predicts that “Big Data will deliver transformational benefits to enterprises within 2 to 5 years"
  • 15. Characteristics of Big Data • Volume – Terabytes  Zettabytes • Variety – structured, semi-structured, unstructured data • Velocity – Batch -> Streaming Data, Real-time
  • 17. Traditional Data Warehouse vs. Big Data • Traditional warehouses − mostly idea for analyzing structured data and producing insights with known and relatively stable measurements. • Big Data solutions − idea for analyzing not only raw structured data, but semi- structured and structured data from a wide variety of sources. − idea when all of the data needs to be analyzed versus a sample of data. − Idea for iterative and exploratory analysis when business measures are not predetermined.
  • 18. CAP Theorem • CAP − Consistency − Availability − Tolerance to network Partitions • Consistency models − Strong consistency − Weak consistency − Eventual consistency • Architectures − CA: traditional relational database − AP: NoSQL database
  • 19. ACID vs. BASE • ACID − Atomicity − Consistency − Isolation − Durability • BASE − Basically available − Soft-state − Eventual consistency
  • 20. Lower Priorities • No Complex querying functionality − No support for SQL − CRUD operations through database specific API • No support for joins − Materialize simple join results in the relevant row − Give up normalization of data? • No support for transactions − Most data stores support single row transactions − Tunable consistency and availability (e.g., Dynamo)  Achieve high scalability
  • 21. Why sacrifice Consistency? • It is a simple solution − nobody understands what sacrificing P means − sacrificing A is unacceptable in the Web − possible to push the problem to app developer • C not needed in many applications − Banks do not implement ACID (classic example wrong) − Airline reservation only transacts reads (Huh?) − MySQL et al. ship by default in lower isolation level • Data is noisy and inconsistent anyway − making it, say, 1% worse does not matter
  • 22. Important Design Goals • Scale out: designed for scale − Commodity hardware − Low latency updates − Sustain high update/insert throughput • Elasticity – scale up and down with load • High availability – downtime implies lost revenue − Replication (with multi-mastering) − Geographic replication − Automated failure recovery
  • 23. A Brief History of Hadoop • Hadoop is an open source project of the Apache Foundation. • Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. • In 2003, Google published a paper that described the architecture of Google’s distributed filesystem, called GFS. • In 2004, Google published the paper that introduced MapReduce. • It is a framework written in Java originally developed by Doug Cutting, the creator of Apache Lucene, who named it after his son's toy elephant. • 2004 - Initial versions of what is now Hadoop Distributed Filesystem and Map- Reduce implemented. • January 2006 — Doug Cutting joins Yahoo!. • February 2006 —Adoption of Hadoop by Yahoo! Grid team. • April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
  • 24. A Brief History of Hadoop (Cont) • January 2007—Research cluster reaches 900 nodes. • In January 2008, Hadoop was made its own top-level project at Apache. By this time, Hadoop was being used by many other companies such as Facebook and the New York Times. • In February 2008, Yahoo! announced that its production search index was being generated by a 10,000-node Hadoop cluster. • In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. • March 2009 — 17 clusters with a total of 24,000 nodes. • April 2009 — Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes).
  • 25. Hadoop Echosystem • Common - A set of components for distributed filesystems and general I/O • Avro - A serialization system for efficient data storage. • MapReduce - A distributed data processing model and execution environment that runs on large clusters of commodity machines. • HDFS - A distributed filesystem. • Pig - A data flow language for exploring very large datasets. • Hive - A distributed data warehouse system. • Hbase - A distributed, column-oriented database. • ZoopKeeper - A distributed, highly available coordination service. • Sqoop - A tool for efficiently moving data between relational databases and HDFS.
  • 26. Hadoop Distributed File System - HDFS • Hadoop filesystem that runs on top of existing file system • Designed to handle very large files with streaming data access patterns • Use blocks to store a file or parts of file − 64MB (default), 128MB (recommended) - compare to 4KB in UNIX • 1 HDFS block is supported by multiple operation system blocks • Advantages of blocks − Big throughput − Fixed size - easy to calculate how many fit on a disk − A file can be larger than any single disk in the network − Fits well with replication to provide fault tolerance and availability
  • 29. Hadoop Node Type • HDFS nodes • NameNode • One per cluster, manages the filesystem namespace and meta data, large memory requirements, keep entire filesystem metadata in memory • DataNode • Many per cluster, manages blocks with data and servers them to clients • MapReduce nodes • JobTracker • One per cluster, receives job requests, schedules and monitors MapReduce jobs on task trackers • TaskTracker • Many per cluster, each TaskTracker spawns Java Virtual Machines to run your map or reduce task.
  • 32. Before MapReduce… • Large scale data processing was difficult! − Managing hundreds or thousands of processors − Managing parallelization and distribution − I/O Scheduling − Status and monitoring − Fault/crash tolerance • MapReduce provides all of these, easily!
  • 33. MapReduce Overview • What is it? − Programming model used by Google − A combination of the Map and Reduce models with an associated implementation − Used for processing and generating large data sets • How does it solve our previously mentioned problems? − MapReduce is highly scalable and can be used across many computers. − Many small machines can be used to process jobs that normally could not be processed by a large machine.
  • 34. Simple Data Flow Example
  • 35. Another Data Flow Example
  • 37. Map Abstraction • Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate • Evaluation – Function defined by user – Applies to every value in value input – Might need to parse input • Produces a new list of key/value pairs – Can be different type from input pair
  • 39. Reduce Abstraction • Starts with intermediate Key / Value pairs • Ends with finalized Key / Value pairs • Starting pairs are sorted by key • Iterator supplies the values for a given key to the Reduce function. • Typically a function that: − Starts with a large number of key/value pairs − One key/value for each word in all files being greped (including multiple entries for the same word) − Ends with very few key/value pairs − One key/value for each unique word across all the files with the number of instances summed into this entry • Broken up so a given worker works with input of the same key.
  • 42. Why is this approach better? • Creates an abstraction for dealing with complex overhead − The computations are simple, the overhead is messy • Removing the overhead makes programs much smaller and thus easier to use − Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested • Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation
  • 43. MapReduce Advantages • Automatic Parallelization: − Depending on the size of RAW INPUT DATA  instantiate multiple MAP tasks − Similarly, depending upon the number of intermediate <key, value> partitions  instantiate multiple REDUCE tasks • Run-time: − Data partitioning − Task scheduling − Handling machine failures − Managing inter-machine communication • Completely transparent to the programmer/analyst/user
  • 44. MapReduce: A step backwards? • Don’t need 1000 nodes to process petabytes: − Parallel DBs do it in fewer than 100 nodes • No support for schema: − Sharing across multiple MR programs difficult • No indexing: − Wasteful access to unnecessary data • Non-declarative programming model: − Requires highly-skilled programmers • No support for JOINs: − Requires multiple MR phases for the analysis
  • 45. MapReduce VS Parallel DB • Web application data is inherently distributed on a large number of sites: − Funneling data to DB nodes is a failed strategy • Distributed and parallel programs difficult to develop: − Failures and dynamics in the cloud • Indexing: − Sequential Disk access 10 times faster than random access. − Not clear if indexing is the right strategy. • Complex queries: − DB community needs to JOIN hands with MR
  • 46. NoSQL Movement • Initially used for: “Open-Source relational database that did not expose SQL interface” • Popularly used for: “non-relational, distributed data stores that often did not attempt to provide ACID guarantees” • Gained widespread popularity through a number of open source projects − HBase, Cassandra, MongDB, Redis, … • Scale-out, elasticity, flexible data model, high availability
  • 47. Data in Real World • There real data sets that don’t make sense in the relational model, nor modern ACID databases. • Fit what into where? − Trees − Semi-structured data − Web content − Multi-dimensional cubes − Graphs
  • 48. NoSQL Database Technology • Not only SQL − No schema, more dynamic data model − Denormalizing, no join − CAP theory − Auto-sharding (elasticity) − Distributed query support − Integrated caching
  • 49. NoSQL Databases • Key-Value store − Redis (in memory), Riak • Column oriented − Cassandra, HBase, Dynamo, BigTable • Document oriented − MongoDB (JSON), CouchBase • Graph
  • 50. Key Value Stores • Key-Valued data model − Key is the unique identifier − Key is the granularity for consistent access − Value can be structured or unstructured • Gained widespread popularity − In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo (Amazon) − Open source: HBase, Hypertable, Cassandra, Voldemort • Popular choice for the modern breed of web-applications
  • 51. Cassandra – A NoSQL Database • An open source, distributed store for structured data that scales-out on cheap, commodity hardware • Simplicity of Operations • Transparency • Very High Availability • Painless Scale-Out • Solid, Predictable Performance on Commodity and Cloud Servers
  • 53. Column Oriented – Data Structure • Tuples: {“key”: {“name”: “value”: “timestamp”} } insert(“carol”, { “car”: “daewoo”, 2011/11/15 15:00 }) Row Key jim age: 36 2011/01/01 12:35 car: camaro 2011/01/01 12:35 gender: M 2011/01/01 12:35 carol age: 37 2011/01/01 12:35 car: subaru 2011/01/01 12:35 gender: F 2011/01/01 12:35 johnny age: 12 2011/01/01 12:35 gender: M 2011/01/01 12:35   suzy age: 10 2011/01/01 12:35 gender: F 2011/01/01 12:35  
  • 54. Massively Parallel Processing (MPP) DB • Vertica (HP) • Greenplum (EMC) • Netezza (IBM) • Teradata (NCR) • Kognitio − In memory analytic − No need for data partition or indexing − Scans data in excess of 650 million rows per second per server. Linear scalability means 100 nodes can scan over 650 billion rows per second!
  • 55. Vertica • Supports logical relational models, SQL, ACID transactions, JDBC • Columnar Store Architecture − 50x--‐1000x faster by eliminating costly disk IO − offers aggressive data compression to reduce storage costs by up to 90% • 20x – 100x faster than traditional RDBMS data warehouse, runs on commodity hardware • Scale-out MPP Architecture • Real-time loading and querying • In-Database Analytics • Automatic high availability • Natively support grid computing • Natively support MapReduce and Hadoop
  • 56. Machine Learning • Machine learning systems automate decision making on data, automatically producing outputs like product recommendations or groupings. • WEKA - a Java-based framework and GUI for machine learning algorithms. • Mahout - an open source framework that can run common machine learning algorithms on massive datasets.
  • 57. Popular Technologies • Databases − HBase, Cassandra, MongoDB, Redis, CouchDB, Vertica, Greenplum, Netezza • Programing Languages − Java; Python, Perl; Hive, Pig, JAQL; • ETL tools − Talend, Pentaho • BI tools − Pentaho, Tableau • Analytics − R, Mahout, BigInsight • Methology − Agile • Other − Hadoop, MapReduce, Lucene, Solr, JSON, UIMA, ZooKeeper
  • 59. References • Big data: The next frontier for innovation, competition and productivity, McKinsey Global Institute, May 2011 • Understanding Big Data, IBM, 2012 • NoSQL Database Technology Whitepaper, CouchBase • Big Data and Cloud Computing: Current State and Future Opportunities, 2011 • Hadoop Definitive Guide • How Do I Cassandra, Nov 2011 • BigDataUniversity.com • youtube.com/ibmetinfo • ……