Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
2. Agenda
• What is big Data and the 3-Vs
• Introduction to Hadoop
• Who Handles Big Data and Data Science
• NoSQL
http://brillix.co.il2
3. Who am I?
• Zohar Elkayam, CTO at Brillix
• Oracle ACE Associate
• DBA, team leader, instructor and senior consultant for over 16 years
• Editor (and manager) of ilDBA – Israel Database Community
• Blogger – www.realdbamagic.com
http://brillix.co.il3
7. So, What is Big Data?
• When the data is too big or moves too fast to handle in a
sensible amount of time.
• When the data doesn’t fit conventional database structure.
• What the solution becomes part of the problem.
8. Big Problems with Big Data
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• Low quality
• And generally messy
• Oh, and there is a lot of it
10. MEDIA/
ENTERTAINMENT
Viewers /
advertising
effectiveness
COMMUNICATIONS
Location-based
advertising
EDUCATION
&
RESEARCH
Experiment
sensor
analysis
CONSUMER
PACKAGED
GOODS
Sentiment
analysis of what’s
hot, problems
HEALTH CARE
Patient sensors,
monitoring,
EHRs
Quality of care
LIFE
SCIENCES
Clinical trials
Genomics
HIGH
TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling
exploration
sensor
analysis
FINANCIAL
SERVICES
Risk & portfolio
analysis
New products
AUTOMOTIVE
Auto sensors
reporting
location,
problems
RETAIL
Consumer
sentiment
Optimized
marketing
LAW
ENFORCEMENT
& DEFENSE
Threat analysis -
social media
monitoring, photo
analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for
optimal traffic flows
Customer sentiment
UTILITIES
Smart
Meter
analysis for
network
capacity,
Sample of Big Data Use Cases Today
ON-LINE
SERVICES /
SOCIAL
MEDIA
People &
career
matching
Web-site
optimization
11. Most Requested Uses of Big Data
• Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
14. Big Data: Challenge to Value
Business
Value
High Variety
High Volume
High Velocity
Today
Deep Analytics
High Agility
Massive Scalability
Real Time
Tomorrow
Challenges
15. Volume
• Big data come in one size: Big.
Size is measured in terabytes, petabytes and even exabytes
and zeta bytes.
• The storing and handling of the data becomes an issue.
• Producing value out of the data in a reasonable time is also
an issue.
16. Velocity
• The speed in which the data is being generated and collected.
• Streaming data and large volume data movement .
• High velocity of data capture – requires rapid ingestion.
• What happens on downtime (the backlog problem).
17. Variety
• Big Data extends beyond structured data: including semi-
structured and unstructured information: logs, text, audio and
videos.
• Wide variety of rapidly evolving data types requires highly
flexible stores and handling.
18. Big Data is ANY data
Unstructured, Semi-Structure and Structured
• Some has fixed structure
• Some is “bring own structure”
• We want to find value in all of it
22. Big Data in Practice
• Big data is big: technological infrastructure solutions needed.
• Big data is messy: data sources must be cleaned before use.
• Big data is complicated: need developers and system admins
to manage intake of data.
23. Big Data in Practice (cont.)
• Data must be broken out of silos in order to be mined,
analyzed and transformed into value.
• The organization must learn how to communicate and
interpret the results of analysis.
25. Infrastructure Challenges – Cont.
• Storage:
• Efficient and cost-effective enough to capture and store terabytes, if
not petabytes, of data
• With intelligent capabilities to reduce your data footprint such as:
• Data compression
• Automatic data tiering
• Data deduplication
26. Infrastructure Challenges – Cont.
• Network infrastructure that can quickly import large data sets
and then replicate it to various nodes for processing
• Security capabilities that protect highly-distributed
infrastructure and data
28. Apache Hadoop
• Open source project run by Apache (2006).
• Hadoop brings the ability to cheaply process large amounts
of data, regardless of its structure.
• Apache Hadoop has been the driving force behind the growth
of the big data Industry.
30. Key points
• An open-source framework that uses a simple programming model
to enable distributed processing of large data sets on clusters of
computers.
• The complete technology stack includes
• common utilities
• a distributed file system
• analytics and data storage platforms
• an application layer that manages distributed processing, parallel
computation, workflow, and configuration management
• Cost-effective for handling large unstructured data sets than
conventional approaches, and it offers massive scalability and
speed
31. Why use Hadoop?
Cost Flexibility
Near linear
performance up
to 1000s of nodes
Leverages
commodity HW &
open source SW
Versatility with
data, analytics &
operation
Scalability
32. Really, Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
• Failure is expected, rather than exceptional.
• The number of nodes in a cluster is not constant.
• Need common infrastructure
• Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
33. Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
34. Hadoop Limitations
• Hadoop is scalable but not fast
• Some assembly required
• Batteries not included
• Instrumentation not included either
• DIY mindset (remember Linux/MySQL?)
• On the larger scale – Hadoop is not cheap (but still cheaper
than using old solutions)
35. Example Comparison: RDBMS vs. Hadoop
Typical Traditional RDBMS Hadoop
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
36. Relational Database
Best Used For:
Interactive OLAP Analytics (<1sec)
Multistep Transactions
100% SQL Compliance
Best Used For:
Structured or Not (Flexibility)
Scalability of Storage/Compute
Complex Data Processing
Cheaper compared to RDBMS
Best when used together
Hadoop And Relational Database
38. Hadoop Main Components
• HDFS: Hadoop Distributed File System – distributed file
system that runs in a clustered environment.
• MapReduce – programming paradigm for running processes
over a clustered environments.
39. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
39
40. HDFS Node Types
HDFS has three types of Nodes
• Namenode (MasterNode)
• Distribute files in the cluster
• Responsible for the replication between
the datanodes and for file blocks location
• Datanodes
• Responsible for actual file store
• Serving data from files(data) to client
• BackupNode (version 0.23 and up)
• It’s a backup of the NameNode
41. Typical implementation
• Nodes are commodity PCs
• 30-40 nodes per rack
• Uplink from racks is 3-4 gigabit
• Rack-internal is 1 gigabit
42. MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing such
computations
• An open-source implementation called Hadoop
42
43. MapReduce
Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar
- input myInputDirs
- output myOutputDir
- mapper /bin/cat
- reducer /bin/wc
• Runs programs (jobs) across many computers
• Protects against single server failure by re-run failed steps.
• MR jobs can be written in Java, C, Phyton, Ruby and etc.
• Users only write Map and Reduce functions
• MAP - Takes a large problem and divides into sub problems.
Performs the same function on all subsystems
• REDUCE - Combine the output from all sub-problems
44. Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
44
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
45. MapReduce paradigm
• Implement two functions:
• Map(k1, v1) -> list(k2, v2)
• Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else*
• Value with same key go to same reducer
45
47. MapReduce - word count example
function map(String name, String document):
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)
47
49. MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
49
50. MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
50
51. MapReduce is NOT good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
51
53. Improving Hadoop
Core Hadoop is complicated so some tools were added to make
things easier so tools were created to make things easier.
Improving programmability:
• Pig: Programming language that simplifies Hadoop actions: loading,
transforming and sorting data
• Hive: enables Hadoop to operate as data warehouse using SQL-like
syntax.
54. Pig
• Data flow processing
• Uses Pig Latin query language
• Highly parallel in order to distribute data processing across many servers
• Combining multiple data sources (Files, Hbase, Hive)
• Example:
55. Hive
• Built on the MapReduce framework so it generates MR jobs behind it
• Hive is a data warehouse that enables easy data summarization and ad-hoc queries via
an SQL-like interface for large datasets stored in HDFS/Hbase.
• Have partitioning and partition swapping
• Good for random sampling
• Example: CREATE EXTERNAL TABLE vs_hdfs (
site_id string,
session_id string,
time_stamp bigint,
visitor_id bigint,
row_unit string,
evts string,
biz string,
plne string,
dims string)
partitioned by (site string,day string)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY '001'
STORED AS SEQUENCEFILE LOCATION
'/home/data/';
select session_id,
get_json_object(concat(tttt, "}"), '$.BY'),
get_json_object(concat(tttt, "}"), '$.TEXT') from
(
select session_id,concat("{",
regexp_replace(event, "[{|}]", ""), "}") tttt
from (
select session_id,get_json_object(plne,
'$.PLine.evts[*]') pln
from vs_hdfs_v1 where site='6964264'
and day='20120201' and plne!='{}' limit 10 ) t
LATERAL VIEW explode(split(pln, "},{"))
adTable AS event )t2
57. Improving Hadoop (cont.)
For improving access:
• HBase: column oriented database that runs on HDFS.
• Sqoop: a tool designed to import data from relational
databases (HDFS or Hive).
58. Hbase
What is Hbase and why should you use HBase?
• Huge volumes of randomly accessed data.
• There is no restrictions on column numbers for rows it’s dynamic.
• Consider HBase when you’re loading data by key, searching data by key (or range),
serving data by key, querying data by key or when storing data by row that doesn’t
conform well to a schema.
Hbase dont’s?
• It doesn’t talk SQL, have an optimizer, support in transactions or joins. If you don’t use
any of these in your database application then HBase could very well be the perfect fit.
Example:
create ‘blogposts’, ‘post’, ‘image’ ---create table
put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value
put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value
put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value
get ‘blogposts’, ‘id1′ ---select records
59. Sqoop
What is Sqoop?
• It’s a command line tool for moving data between HDFS and relational database systems.
• You can download drivers for Sqoop from Microsoft and
• Import Data/Query results from SQL Server to Hadoop.
• Export Data from Hadoop to SQL Server.
• It’s like BCP
• Example:
$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch'
--table lineitem --hive-import
$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir
/data/lineitemData
60. Improving Hadoop (cont.)
• For improving coordination: Zookeeper
• For improving scheduling/orchestration: Oozie
• For Improving UI: Hue
• Machine learning: Mahout
66. Big Data Market Survey
• 3 major groups for rolling your own Big Data:
• Integrated Hadoop providers.
• Analytical database with Hadoop connectivity.
• Hadoop-centered companies.
• Big Data on the Cloud.
71. Hadoop Centered Companies
• Cloudera – longest-established of Hadoop distribution.
• Hortonworks – major contributor to the Hadoop code and core
components.
• MapR.
72. Big Data and Cloud
• Some Big Data solution can be provided using IaaS:
Infrastructure as a service.
• Private clouds can be constructed using Hadoop orchestration
tools.
• Public clouds provided by Rockspace or Amazon EC2 can be
use to start an Hadoop cluster.
73. Big Data and Cloud (cont.)
• PaaS: Platform as a Service can be used to remove the need to
configure or scale things.
• The major PaaS Providers are Amazon, Google and Microsoft.
74. PaaS Services: Amazon
• Amazon:
• Elastic Map Reduce (EMR): MapReduce programs submitted to a
cluster managed by Amazon. Good for EC2/S3 combinations.
• DynamoDB: NoSQL database provided by Amazon to replace HBase.
75. PaaS Services: Google
• Google:
• BigQuery: analytical database suitable for interactive analysis over
datasets of the order of 1TB.
• Prediction API: machine learning platform for classification and
sentiment analysis be done with their tools on customers data.
76. PaaS Services: Microsoft
• Microsoft:
• Windows Azure: a cloud computing platform and infrastructure that
can be used as PasS and as IaaS.
78. Big Data Readiness
• The R&D Prototype Stage
• Skills needed:
• Distributed data deployment (e.g. Hadoop)
• Python or Java programming with MapReduce
• Statistical analysis (e.g. R)
• Data integration
• Ability to formulate business hypotheses
• Ability to convey business value of Big Data
79. Data Science
• A discipline that combines math, statistics, programming and
scientific instinct with the goal of extracting meaning from
data.
• Data scientists combine technical expertise curiosity,
storytelling and cleverness to find and deliver the signal in the
noise.
80. The Rise of the Data Scientist
• Data scientists are responsible for
• modeling complex business problems
• discovering business insights
• identifying opportunities.
• Demand is high for people who can help make sense of the
massive streams of digital information pouring into
organizations
81. Big Data Scientist
• Industry Expertise
• AnalyticsSkills
Big Data Engineers
• Hadoop/Java
• Non-Relational DB
Agility and Focus on Value
New Roles and Skills
82.
83. Predictive Analytics
• Predictive analytics looks into the future to provide insight into
what will happen and includes what-if scenarios and risk
assessment. It can be used for
• Forecasting
• hypothesis testing
• risk modeling
• propensity modeling
84. Prescriptive analytics
• Prescriptive analytics is focused on understanding what would
happen based on different alternatives and scenarios, and then
choosing best options, and optimizing what’s ahead. Use
cases include
• Customer cross-channel optimization
• best-action-related offers
• portfolio and business optimization
• risk management
85. How Predictive Analytics Works
• Traditional BI tools use a deductive approach to data, which
assumes some understanding of existing patterns and
relationships.
• An analytics model approaches the data based on this
knowledge.
• For obvious reasons, deductive methods work well with
structured data
86. Inductive approach
• An inductive approach makes no presumptions of patterns or
relationships and is more about data discovery. Predictive
analytics applies inductive reasoning to big data using
sophisticated quantitative methods such as
• machine learning
• neural networks
• Robotics
• computational mathematics
• artificial intelligence
• Explore all the data and to discover interrelationships and
patterns
87. Inductive approach – Cont.
• Inductive methods use algorithms to perform complex
calculations specifically designed to run against highly varied
or large volumes of data
• The result of applying these techniques to a real-world
business problem is a predictive model
• The ability to know what algorithms and data to use to test and
create the predictive model is part of the science and art of
predictive analytics
88. Share nothing vs. Share everything
Share nothing Share everything
Many processing engines Many Servers
Data is spread on many nodes Data is located on a single storage
Joins are problematic Efficient Joins
Very Scalable Limited Scalability
90. The Challenge
• We want scalable, durable, high volume, high velocity,
distributed data storage that can handle non-structured data
and that will fit our specific need
• RDBMS is too generic and doesn’t cut it any more – it can do
the job but it is not cost effective to our usages
90
91. The Solution: NoSQL
• Let’s take some parts of the standard RDBMS out to and
design the solution to our specific uses
• NoSQL databases have been around for ages under different
names/solutions
91
92. The NOSQL Movement
• NOSQL is not a technology – it’s a concept.
• We need high performance, scale out abilities or an agile
structure.
• We are now willing to sacrifice our sacred cows: consistency,
transactions.
• Over 150 different brands and solutions
(http://nosql-database.org/).
93. NoSQL or NOSQL
• NoSQL is not No to SQL
• NoSQL is not Never SQL
• NOSQL = Not Only SQL
94. Why NoSQL?
• Some applications need very few database features, but need
high scale.
• Desire to avoid data/schema pre-design altogether for simple
applications.
• Need for a low-latency, low-overhead API to access data.
• Simplicity -- do not need fancy indexing – just fast lookup by
primary key.
95. Why NoSQL? (cont.)
• Developer friendly, DBAs not needed (?).
• Schema-less.
• Agile: non-structured (or semi-structured).
• In Memory.
• No (or loose) Transactions.
• No joins.
96.
97. Is NoSQL a RDMS Replacement?
NO
97
Well... Sometimes it does…
98. RDBMS vs. NoSQL
Rationale for choosing a persistent store:
98
Relational Architecture NoSQL Architecture
High value, high density, complex
Data
Low value, low density, simple data
Complex data relationships Very simple relationships
Schema-centric Schema-free, unstructured or
semistructured Data
Designed to scale up & out Distributed storage and processing
Lots of general purpose
features/functionality
Stripped down, special purpose
data store
High overhead ($ per operation) Low overhead ($ per operation)
100. Scalability
• NoSQL is sometimes very easy to scale out
• Most have dynamic data partitioning and easy data distribution
• But distributed system always come with a price: The CAP
Theorem and impact on ACID transactions
100
101. ACID Transactions
Most DBMS are built with ACID transactions in mind:
• Atomicity: All or nothing, performs write operations as a single
transaction
• Consistency: Any transaction will take the DB from one
consistent state to another with no broken constraints,
ensures replicas are identical on different nodes
• Isolation: Other operations cannot access data that has been
modified during a transaction that has not been completed yet
• Durability: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
101
102. ACID Transactions (cont.)
• ACID is usually implemented by a locking mechanism/manager
• Distributed systems central locking can be a bottleneck in that
system
• Most NoSQL does not use/limit the ACID transactions and
replaces it with something else…
102
103. CAP Theorem
• The CAP theorem states that in a distributed/partitioned
application, you can only pick two of the following
three characteristics:
• Consistency.
• Availability.
• Partition Tolerance.
105. NoSQL BASE
• NoSQL usually provide BASE characteristics instead of ACID.
BASE stands for:
• Basically Available
• Soft State
• Eventual Consistency
• It means that when an update is made in one place, the other
partitions will see it over time - there might be an inconsistency
window
• read and write operations complete more quickly, lowering
latency
110. Key Value Store
• Distributed hash tables.
• Very fast to get a single value.
• Examples:
• Amazon DynamoDB
• Berkeley DB
• Redis
• Riak
• Cassandra
111. Document Store
• Similar to Key/Value, but value is a document.
• JSON or something similar, flexible schema
• Agile technology.
• Examples:
• MongoDB
• CouchDB
• CouchBase
112. Column Store
• One key, multiple attributes.
• Hybrid row/column.
• Examples:
• Google BigTable
• Hbase
• Amazon’s SimpleDB
• Cassandra
113. How Records are Organized?
• This is a logical table in RDBMS systems
• Its physical organization is just like the logical one: column by
column, row by row
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
http://brillix.co.il113
114. Query Data
• When we query data, records are read at the
order they are organized in the physical structure
• Even when we query a single
column, we still need to read the
entire table and extract the column
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
Select Col2
From MyTable
Select *
From MyTable
http://brillix.co.il114
115. How Does Column Store Save Data
Organization in row store Organization in column store
http://brillix.co.il116
116. Graph Store
• Inspired by Graph Theory.
• Data model: Nodes, relationships, properties on both.
• Relational Database have very hard time to represent a graph
in the Database.
• Example:
• Neo4j
• InfiniteGraph
• RDF
117. • An abstract representation of a set of objects where some
pairs are connected by links.
• Object (Vertex, Node) – can have attributes like name and
value
• Link (Edge, Arc, Relationship) – can have attributes like type
and name or date
What is Graph
NODE
Edge
122. Conclusion
• Big Data is one of the hottest buzzwords in last few years – we
should all know what it’s all about
• DBAs are often called upon big data problems – today DBAs
needs to know what to ask to provide good solutions even if
it’s not a database related issue
• NoSQL doesn’t have to be Big Data solutions but Big Data
often use NoSQL solutions
http://brillix.co.il123