SlideShare ist ein Scribd-Unternehmen logo
1 von 123
Downloaden Sie, um offline zu lesen
Zohar Elkayam
CTO, Brillix
Zohar@Brillix.co.il
Twitter: @realmgic
Introduction to Big Data
Agenda
• What is big Data and the 3-Vs
• Introduction to Hadoop
• Who Handles Big Data and Data Science
• NoSQL
http://brillix.co.il2
Who am I?
• Zohar Elkayam, CTO at Brillix
• Oracle ACE Associate
• DBA, team leader, instructor and senior consultant for over 16 years
• Editor (and manager) of ilDBA – Israel Database Community
• Blogger – www.realdbamagic.com
http://brillix.co.il3
What is Big Data?
http://brillix.co.il4
http://brillix.co.il5
So, What is Big Data?
• When the data is too big or moves too fast to handle in a
sensible amount of time.
• When the data doesn’t fit conventional database structure.
• What the solution becomes part of the problem.
Big Problems with Big Data
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• Low quality
• And generally messy
• Oh, and there is a lot of it
http://brillix.co.il9
MEDIA/
ENTERTAINMENT
Viewers /
advertising
effectiveness
COMMUNICATIONS
Location-based
advertising
EDUCATION
&
RESEARCH
Experiment
sensor
analysis
CONSUMER
PACKAGED
GOODS
Sentiment
analysis of what’s
hot, problems
HEALTH CARE
Patient sensors,
monitoring,
EHRs
Quality of care
LIFE
SCIENCES
Clinical trials
Genomics
HIGH
TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling
exploration
sensor
analysis
FINANCIAL
SERVICES
Risk & portfolio
analysis
New products
AUTOMOTIVE
Auto sensors
reporting
location,
problems
RETAIL
Consumer
sentiment
Optimized
marketing
LAW
ENFORCEMENT
& DEFENSE
Threat analysis -
social media
monitoring, photo
analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for
optimal traffic flows
Customer sentiment
UTILITIES
Smart
Meter
analysis for
network
capacity,
Sample of Big Data Use Cases Today
ON-LINE
SERVICES /
SOCIAL
MEDIA
People &
career
matching
Web-site
optimization
Most Requested Uses of Big Data
• Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
The Challenge
http://brillix.co.il12
The Big Data Challenge (3V)
Big Data: Challenge to Value
Business
Value
 High Variety
 High Volume
 High Velocity
Today
 Deep Analytics
 High Agility
 Massive Scalability
 Real Time
Tomorrow
Challenges
Volume
• Big data come in one size: Big.
Size is measured in terabytes, petabytes and even exabytes
and zeta bytes.
• The storing and handling of the data becomes an issue.
• Producing value out of the data in a reasonable time is also
an issue.
Velocity
• The speed in which the data is being generated and collected.
• Streaming data and large volume data movement .
• High velocity of data capture – requires rapid ingestion.
• What happens on downtime (the backlog problem).
Variety
• Big Data extends beyond structured data: including semi-
structured and unstructured information: logs, text, audio and
videos.
• Wide variety of rapidly evolving data types requires highly
flexible stores and handling.
Big Data is ANY data
Unstructured, Semi-Structure and Structured
• Some has fixed structure
• Some is “bring own structure”
• We want to find value in all of it
Structured & Un-Structured
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
Handling Big Data
http://brillix.co.il20
Big Data in Practice
• Big data is big: technological infrastructure solutions needed.
• Big data is messy: data sources must be cleaned before use.
• Big data is complicated: need developers and system admins
to manage intake of data.
Big Data in Practice (cont.)
• Data must be broken out of silos in order to be mined,
analyzed and transformed into value.
• The organization must learn how to communicate and
interpret the results of analysis.
Infrastructure Challenges
• Infrastructure that is built for:
• Large-scale
• Distributed
• Data-intensive jobs that spread the problem across
clusters of server nodes
Infrastructure Challenges – Cont.
• Storage:
• Efficient and cost-effective enough to capture and store terabytes, if
not petabytes, of data
• With intelligent capabilities to reduce your data footprint such as:
• Data compression
• Automatic data tiering
• Data deduplication
Infrastructure Challenges – Cont.
• Network infrastructure that can quickly import large data sets
and then replicate it to various nodes for processing
• Security capabilities that protect highly-distributed
infrastructure and data
Intro to Hadoop
http://brillix.co.il27
Apache Hadoop
• Open source project run by Apache (2006).
• Hadoop brings the ability to cheaply process large amounts
of data, regardless of its structure.
• Apache Hadoop has been the driving force behind the growth
of the big data Industry.
Hadoop Creation History
Key points
• An open-source framework that uses a simple programming model
to enable distributed processing of large data sets on clusters of
computers.
• The complete technology stack includes
• common utilities
• a distributed file system
• analytics and data storage platforms
• an application layer that manages distributed processing, parallel
computation, workflow, and configuration management
• Cost-effective for handling large unstructured data sets than
conventional approaches, and it offers massive scalability and
speed
Why use Hadoop?
Cost Flexibility
Near linear
performance up
to 1000s of nodes
Leverages
commodity HW &
open source SW
Versatility with
data, analytics &
operation
Scalability
Really, Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
• Failure is expected, rather than exceptional.
• The number of nodes in a cluster is not constant.
• Need common infrastructure
• Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
Hadoop Limitations
• Hadoop is scalable but not fast
• Some assembly required
• Batteries not included
• Instrumentation not included either
• DIY mindset (remember Linux/MySQL?)
• On the larger scale – Hadoop is not cheap (but still cheaper
than using old solutions)
Example Comparison: RDBMS vs. Hadoop
Typical Traditional RDBMS Hadoop
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
Relational Database
Best Used For:
 Interactive OLAP Analytics (<1sec)
 Multistep Transactions
 100% SQL Compliance
Best Used For:
 Structured or Not (Flexibility)
 Scalability of Storage/Compute
 Complex Data Processing
 Cheaper compared to RDBMS
Best when used together
Hadoop And Relational Database
Hadoop Components
http://brillix.co.il37
Hadoop Main Components
• HDFS: Hadoop Distributed File System – distributed file
system that runs in a clustered environment.
• MapReduce – programming paradigm for running processes
over a clustered environments.
HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
39
HDFS Node Types
HDFS has three types of Nodes
• Namenode (MasterNode)
• Distribute files in the cluster
• Responsible for the replication between
the datanodes and for file blocks location
• Datanodes
• Responsible for actual file store
• Serving data from files(data) to client
• BackupNode (version 0.23 and up)
• It’s a backup of the NameNode
Typical implementation
• Nodes are commodity PCs
• 30-40 nodes per rack
• Uplink from racks is 3-4 gigabit
• Rack-internal is 1 gigabit
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing such
computations
• An open-source implementation called Hadoop
42
MapReduce
Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar 
- input myInputDirs 
- output myOutputDir 
- mapper /bin/cat 
- reducer /bin/wc
• Runs programs (jobs) across many computers
• Protects against single server failure by re-run failed steps.
• MR jobs can be written in Java, C, Phyton, Ruby and etc.
• Users only write Map and Reduce functions
• MAP - Takes a large problem and divides into sub problems.
Performs the same function on all subsystems
• REDUCE - Combine the output from all sub-problems
Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
44
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
MapReduce paradigm
• Implement two functions:
• Map(k1, v1) -> list(k2, v2)
• Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else*
• Value with same key go to same reducer
45
46
Divide and Conquer
MapReduce - word count example
function map(String name, String document):
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)
47
MapReduce Word Count Process
http://brillix.co.il48
MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
49
MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
50
MapReduce is NOT good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
51
Improving Hadoop
http://brillix.co.il52
Improving Hadoop
Core Hadoop is complicated so some tools were added to make
things easier so tools were created to make things easier.
Improving programmability:
• Pig: Programming language that simplifies Hadoop actions: loading,
transforming and sorting data
• Hive: enables Hadoop to operate as data warehouse using SQL-like
syntax.
Pig
• Data flow processing
• Uses Pig Latin query language
• Highly parallel in order to distribute data processing across many servers
• Combining multiple data sources (Files, Hbase, Hive)
• Example:
Hive
• Built on the MapReduce framework so it generates MR jobs behind it
• Hive is a data warehouse that enables easy data summarization and ad-hoc queries via
an SQL-like interface for large datasets stored in HDFS/Hbase.
• Have partitioning and partition swapping
• Good for random sampling
• Example: CREATE EXTERNAL TABLE vs_hdfs (
site_id string,
session_id string,
time_stamp bigint,
visitor_id bigint,
row_unit string,
evts string,
biz string,
plne string,
dims string)
partitioned by (site string,day string)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY '001'
STORED AS SEQUENCEFILE LOCATION
'/home/data/';
select session_id,
get_json_object(concat(tttt, "}"), '$.BY'),
get_json_object(concat(tttt, "}"), '$.TEXT') from
(
select session_id,concat("{",
regexp_replace(event, "[{|}]", ""), "}") tttt
from (
select session_id,get_json_object(plne,
'$.PLine.evts[*]') pln
from vs_hdfs_v1 where site='6964264'
and day='20120201' and plne!='{}' limit 10 ) t
LATERAL VIEW explode(split(pln, "},{"))
adTable AS event )t2
HDFS
Map/Reduced
Hive PIG
Yahoo
persistence
Yahoo
scripting
Facebook
SQL Query
Google
Parallel
HADOOP Technology STACK
Improving Hadoop (cont.)
For improving access:
• HBase: column oriented database that runs on HDFS.
• Sqoop: a tool designed to import data from relational
databases (HDFS or Hive).
Hbase
What is Hbase and why should you use HBase?
• Huge volumes of randomly accessed data.
• There is no restrictions on column numbers for rows it’s dynamic.
• Consider HBase when you’re loading data by key, searching data by key (or range),
serving data by key, querying data by key or when storing data by row that doesn’t
conform well to a schema.
Hbase dont’s?
• It doesn’t talk SQL, have an optimizer, support in transactions or joins. If you don’t use
any of these in your database application then HBase could very well be the perfect fit.
Example:
create ‘blogposts’, ‘post’, ‘image’ ---create table
put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value
put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value
put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value
get ‘blogposts’, ‘id1′ ---select records
Sqoop
What is Sqoop?
• It’s a command line tool for moving data between HDFS and relational database systems.
• You can download drivers for Sqoop from Microsoft and
• Import Data/Query results from SQL Server to Hadoop.
• Export Data from Hadoop to SQL Server.
• It’s like BCP
• Example:
$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' 
--table lineitem --hive-import
$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir
/data/lineitemData
Improving Hadoop (cont.)
• For improving coordination: Zookeeper
• For improving scheduling/orchestration: Oozie
• For Improving UI: Hue
• Machine learning: Mahout
HADOOP Technology Eco System
Hadoop Tools
63
Hadoop cluster
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
Hadoop In The Real World
http://brillix.co.il64
Who uses Hadoop?
Big Data Market Survey
• 3 major groups for rolling your own Big Data:
• Integrated Hadoop providers.
• Analytical database with Hadoop connectivity.
• Hadoop-centered companies.
• Big Data on the Cloud.
Integrated Hadoop Providers
• IBM InfoSphere
Database
DB2
Deployment options
Software (Enterprise Linux), Cloud
Hadoop
Bundled distribution (InfoSphere BigInsights); Hive,
Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene
NoSQL
HBase
Integrated Hadoop Providers
• Microsoft
Database
SQL Server
Deployment options
Software (Windows Server), Cloud (Windows Azure
Cloud)
Hadoop
Bundled distribution (Big Data Solution); Hive, Pig
NoSQL
None
Integrated Hadoop Providers
• Oracle
Database
None
Deployment options
Appliance (Oracle Big Data Appliance)
Hadoop
Bundled distribution (Cloudera’s Distribution including
Apache Hadoop); Hive, Oozie, Pig, Zookeeper, Avro,
Flume, HBase, Sqoop, Mahout, Whirr
NoSQL
Oracle NoSQL Database
Integrated Hadoop Providers
• Pivotal Greenplum
Database`
GreenPlum Database
Deployment options
Appliance (Modular Data Computing appliance), Software
(Enterprise Linux), Cloud (Cloud Foundry)
Hadoop
Bundled distribution (Pivotal HD); Hive, Pig, Zookeeper,
HBase
NoSQL
HBase
Hadoop Centered Companies
• Cloudera – longest-established of Hadoop distribution.
• Hortonworks – major contributor to the Hadoop code and core
components.
• MapR.
Big Data and Cloud
• Some Big Data solution can be provided using IaaS:
Infrastructure as a service.
• Private clouds can be constructed using Hadoop orchestration
tools.
• Public clouds provided by Rockspace or Amazon EC2 can be
use to start an Hadoop cluster.
Big Data and Cloud (cont.)
• PaaS: Platform as a Service can be used to remove the need to
configure or scale things.
• The major PaaS Providers are Amazon, Google and Microsoft.
PaaS Services: Amazon
• Amazon:
• Elastic Map Reduce (EMR): MapReduce programs submitted to a
cluster managed by Amazon. Good for EC2/S3 combinations.
• DynamoDB: NoSQL database provided by Amazon to replace HBase.
PaaS Services: Google
• Google:
• BigQuery: analytical database suitable for interactive analysis over
datasets of the order of 1TB.
• Prediction API: machine learning platform for classification and
sentiment analysis be done with their tools on customers data.
PaaS Services: Microsoft
• Microsoft:
• Windows Azure: a cloud computing platform and infrastructure that
can be used as PasS and as IaaS.
Who Handles Big Data
… and how?
http://brillix.co.il77
Big Data Readiness
• The R&D Prototype Stage
• Skills needed:
• Distributed data deployment (e.g. Hadoop)
• Python or Java programming with MapReduce
• Statistical analysis (e.g. R)
• Data integration
• Ability to formulate business hypotheses
• Ability to convey business value of Big Data
Data Science
• A discipline that combines math, statistics, programming and
scientific instinct with the goal of extracting meaning from
data.
• Data scientists combine technical expertise curiosity,
storytelling and cleverness to find and deliver the signal in the
noise.
The Rise of the Data Scientist
• Data scientists are responsible for
• modeling complex business problems
• discovering business insights
• identifying opportunities.
• Demand is high for people who can help make sense of the
massive streams of digital information pouring into
organizations
Big Data Scientist
• Industry Expertise
• AnalyticsSkills
Big Data Engineers
• Hadoop/Java
• Non-Relational DB
Agility and Focus on Value
New Roles and Skills
Predictive Analytics
• Predictive analytics looks into the future to provide insight into
what will happen and includes what-if scenarios and risk
assessment. It can be used for
• Forecasting
• hypothesis testing
• risk modeling
• propensity modeling
Prescriptive analytics
• Prescriptive analytics is focused on understanding what would
happen based on different alternatives and scenarios, and then
choosing best options, and optimizing what’s ahead. Use
cases include
• Customer cross-channel optimization
• best-action-related offers
• portfolio and business optimization
• risk management
How Predictive Analytics Works
• Traditional BI tools use a deductive approach to data, which
assumes some understanding of existing patterns and
relationships.
• An analytics model approaches the data based on this
knowledge.
• For obvious reasons, deductive methods work well with
structured data
Inductive approach
• An inductive approach makes no presumptions of patterns or
relationships and is more about data discovery. Predictive
analytics applies inductive reasoning to big data using
sophisticated quantitative methods such as
• machine learning
• neural networks
• Robotics
• computational mathematics
• artificial intelligence
• Explore all the data and to discover interrelationships and
patterns
Inductive approach – Cont.
• Inductive methods use algorithms to perform complex
calculations specifically designed to run against highly varied
or large volumes of data
• The result of applying these techniques to a real-world
business problem is a predictive model
• The ability to know what algorithms and data to use to test and
create the predictive model is part of the science and art of
predictive analytics
Share nothing vs. Share everything
Share nothing Share everything
Many processing engines Many Servers
Data is spread on many nodes Data is located on a single storage
Joins are problematic Efficient Joins
Very Scalable Limited Scalability
Big Data and NoSQL
http://brillix.co.il89
The Challenge
• We want scalable, durable, high volume, high velocity,
distributed data storage that can handle non-structured data
and that will fit our specific need
• RDBMS is too generic and doesn’t cut it any more – it can do
the job but it is not cost effective to our usages
90
The Solution: NoSQL
• Let’s take some parts of the standard RDBMS out to and
design the solution to our specific uses
• NoSQL databases have been around for ages under different
names/solutions
91
The NOSQL Movement
• NOSQL is not a technology – it’s a concept.
• We need high performance, scale out abilities or an agile
structure.
• We are now willing to sacrifice our sacred cows: consistency,
transactions.
• Over 150 different brands and solutions
(http://nosql-database.org/).
NoSQL or NOSQL
• NoSQL is not No to SQL
• NoSQL is not Never SQL
• NOSQL = Not Only SQL
Why NoSQL?
• Some applications need very few database features, but need
high scale.
• Desire to avoid data/schema pre-design altogether for simple
applications.
• Need for a low-latency, low-overhead API to access data.
• Simplicity -- do not need fancy indexing – just fast lookup by
primary key.
Why NoSQL? (cont.)
• Developer friendly, DBAs not needed (?).
• Schema-less.
• Agile: non-structured (or semi-structured).
• In Memory.
• No (or loose) Transactions.
• No joins.
Is NoSQL a RDMS Replacement?
NO
97
Well... Sometimes it does…
RDBMS vs. NoSQL
Rationale for choosing a persistent store:
98
Relational Architecture NoSQL Architecture
High value, high density, complex
Data
Low value, low density, simple data
Complex data relationships Very simple relationships
Schema-centric Schema-free, unstructured or
semistructured Data
Designed to scale up & out Distributed storage and processing
Lots of general purpose
features/functionality
Stripped down, special purpose
data store
High overhead ($ per operation) Low overhead ($ per operation)
Scalability and Consistency
http://brillix.co.il99
Scalability
• NoSQL is sometimes very easy to scale out
• Most have dynamic data partitioning and easy data distribution
• But distributed system always come with a price: The CAP
Theorem and impact on ACID transactions
100
ACID Transactions
Most DBMS are built with ACID transactions in mind:
• Atomicity: All or nothing, performs write operations as a single
transaction
• Consistency: Any transaction will take the DB from one
consistent state to another with no broken constraints,
ensures replicas are identical on different nodes
• Isolation: Other operations cannot access data that has been
modified during a transaction that has not been completed yet
• Durability: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
101
ACID Transactions (cont.)
• ACID is usually implemented by a locking mechanism/manager
• Distributed systems central locking can be a bottleneck in that
system
• Most NoSQL does not use/limit the ACID transactions and
replaces it with something else…
102
CAP Theorem
• The CAP theorem states that in a distributed/partitioned
application, you can only pick two of the following
three characteristics:
• Consistency.
• Availability.
• Partition Tolerance.
CAP in Practice
http://brillix.co.il104
NoSQL BASE
• NoSQL usually provide BASE characteristics instead of ACID.
BASE stands for:
• Basically Available
• Soft State
• Eventual Consistency
• It means that when an update is made in one place, the other
partitions will see it over time - there might be an inconsistency
window
• read and write operations complete more quickly, lowering
latency
Eventual Consistency
Types of NoSQL
http://brillix.co.il107
NoSQL Taxonomy
Type Examples
Key-Value Store
Document Store
Column Store
Graph Store
SQL comfort zone
size
Complex
Typical
RDBMS
Key
Value
Column
Store
Graph
DATABASE
Document
Database
Performance
NoSQL Map
Key Value Store
• Distributed hash tables.
• Very fast to get a single value.
• Examples:
• Amazon DynamoDB
• Berkeley DB
• Redis
• Riak
• Cassandra
Document Store
• Similar to Key/Value, but value is a document.
• JSON or something similar, flexible schema
• Agile technology.
• Examples:
• MongoDB
• CouchDB
• CouchBase
Column Store
• One key, multiple attributes.
• Hybrid row/column.
• Examples:
• Google BigTable
• Hbase
• Amazon’s SimpleDB
• Cassandra
How Records are Organized?
• This is a logical table in RDBMS systems
• Its physical organization is just like the logical one: column by
column, row by row
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
http://brillix.co.il113
Query Data
• When we query data, records are read at the
order they are organized in the physical structure
• Even when we query a single
column, we still need to read the
entire table and extract the column
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
Select Col2
From MyTable
Select *
From MyTable
http://brillix.co.il114
How Does Column Store Save Data
Organization in row store Organization in column store
http://brillix.co.il116
Graph Store
• Inspired by Graph Theory.
• Data model: Nodes, relationships, properties on both.
• Relational Database have very hard time to represent a graph
in the Database.
• Example:
• Neo4j
• InfiniteGraph
• RDF
• An abstract representation of a set of objects where some
pairs are connected by links.
• Object (Vertex, Node) – can have attributes like name and
value
• Link (Edge, Arc, Relationship) – can have attributes like type
and name or date
What is Graph
NODE
Edge
Graph Types
Undirected Graph
Directed Graph
Pseudo Graph
Multi Graph
NODE
Edge
NODE
NODE
Edge
NODE
NODE
NODE NODE
More Graph Types
Weighted Graph
Labeled Graph
Property Graph
NODE
10
NODE
NODE
Like
NODE
NODE NODE
friend, date 2013
Name:yosi,
Age:40
Name:ami,
Age:30
Relationships
ID:1
TYPE:F
NAME:alice
ID:2
TYPE:M
NAME:bob
ID:1
TYPE:G
NAME:NoS
QL
ID:1
TYPE:F
NAME:dafn
a
TYPE: member
Since:2012
Conclusion
• Big Data is one of the hottest buzzwords in last few years – we
should all know what it’s all about
• DBAs are often called upon big data problems – today DBAs
needs to know what to ask to provide good solutions even if
it’s not a database related issue
• NoSQL doesn’t have to be Big Data solutions but Big Data
often use NoSQL solutions
http://brillix.co.il123
Thank You
Zohar Elkayam
Brillix
Zohar@Brillix.co.il
www.realdbamagic.com
http://brillix.co.il124

Weitere ähnliche Inhalte

Was ist angesagt?

Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Zohar Elkayam
 
Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGZohar Elkayam
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersZohar Elkayam
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerZohar Elkayam
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreFilipe Silva
 
Introduction of MariaDB AX / TX
Introduction of MariaDB AX / TXIntroduction of MariaDB AX / TX
Introduction of MariaDB AX / TXGOTO Satoru
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
MariaDB: Connect Storage Engine
MariaDB: Connect Storage EngineMariaDB: Connect Storage Engine
MariaDB: Connect Storage EngineKangaroot
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Ajay Kumar Uppal
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Alex Gorbachev
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
MySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMario Beck
 
Introduction of MariaDB 2017 09
Introduction of MariaDB 2017 09Introduction of MariaDB 2017 09
Introduction of MariaDB 2017 09GOTO Satoru
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from OracleEDB
 
Introducing Infinispan
Introducing InfinispanIntroducing Infinispan
Introducing InfinispanPT.JUG
 

Was ist angesagt? (20)

Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
 
Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUG
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard Broker
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
 
Introduction of MariaDB AX / TX
Introduction of MariaDB AX / TXIntroduction of MariaDB AX / TX
Introduction of MariaDB AX / TX
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
MariaDB: Connect Storage Engine
MariaDB: Connect Storage EngineMariaDB: Connect Storage Engine
MariaDB: Connect Storage Engine
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
MySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDB
 
Drupal In The Cloud
Drupal In The CloudDrupal In The Cloud
Drupal In The Cloud
 
Introduction of MariaDB 2017 09
Introduction of MariaDB 2017 09Introduction of MariaDB 2017 09
Introduction of MariaDB 2017 09
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
Introducing Infinispan
Introducing InfinispanIntroducing Infinispan
Introducing Infinispan
 

Ähnlich wie Intro to Big Data

Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 

Ähnlich wie Intro to Big Data (20)

Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Mehr von Zohar Elkayam

Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformanceZohar Elkayam
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesZohar Elkayam
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsZohar Elkayam
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceZohar Elkayam
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Zohar Elkayam
 
Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Zohar Elkayam
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceZohar Elkayam
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced QueryingZohar Elkayam
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?Zohar Elkayam
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to ZZohar Elkayam
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarZohar Elkayam
 

Mehr von Zohar Elkayam (16)

Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme Performance
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better Performance
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016
 
Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better Performance
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced Querying
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to Z
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker Webinar
 

Kürzlich hochgeladen

Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Intro to Big Data

  • 1. Zohar Elkayam CTO, Brillix Zohar@Brillix.co.il Twitter: @realmgic Introduction to Big Data
  • 2. Agenda • What is big Data and the 3-Vs • Introduction to Hadoop • Who Handles Big Data and Data Science • NoSQL http://brillix.co.il2
  • 3. Who am I? • Zohar Elkayam, CTO at Brillix • Oracle ACE Associate • DBA, team leader, instructor and senior consultant for over 16 years • Editor (and manager) of ilDBA – Israel Database Community • Blogger – www.realdbamagic.com http://brillix.co.il3
  • 4. What is Big Data? http://brillix.co.il4
  • 6.
  • 7. So, What is Big Data? • When the data is too big or moves too fast to handle in a sensible amount of time. • When the data doesn’t fit conventional database structure. • What the solution becomes part of the problem.
  • 8. Big Problems with Big Data • Unstructured • Unprocessed • Un-aggregated • Un-filtered • Repetitive • Low quality • And generally messy • Oh, and there is a lot of it
  • 10. MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems RETAIL Consumer sentiment Optimized marketing LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, Sample of Big Data Use Cases Today ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization
  • 11. Most Requested Uses of Big Data • Log Analytics & Storage • Smart Grid / Smarter Utilities • RFID Tracking & Analytics • Fraud / Risk Management & Modeling • 360° View of the Customer • Warehouse Extension • Email / Call Center Transcript Analysis • Call Detail Record Analysis
  • 13. The Big Data Challenge (3V)
  • 14. Big Data: Challenge to Value Business Value  High Variety  High Volume  High Velocity Today  Deep Analytics  High Agility  Massive Scalability  Real Time Tomorrow Challenges
  • 15. Volume • Big data come in one size: Big. Size is measured in terabytes, petabytes and even exabytes and zeta bytes. • The storing and handling of the data becomes an issue. • Producing value out of the data in a reasonable time is also an issue.
  • 16. Velocity • The speed in which the data is being generated and collected. • Streaming data and large volume data movement . • High velocity of data capture – requires rapid ingestion. • What happens on downtime (the backlog problem).
  • 17. Variety • Big Data extends beyond structured data: including semi- structured and unstructured information: logs, text, audio and videos. • Wide variety of rapidly evolving data types requires highly flexible stores and handling.
  • 18. Big Data is ANY data Unstructured, Semi-Structure and Structured • Some has fixed structure • Some is “bring own structure” • We want to find value in all of it
  • 19. Structured & Un-Structured Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual
  • 21.
  • 22. Big Data in Practice • Big data is big: technological infrastructure solutions needed. • Big data is messy: data sources must be cleaned before use. • Big data is complicated: need developers and system admins to manage intake of data.
  • 23. Big Data in Practice (cont.) • Data must be broken out of silos in order to be mined, analyzed and transformed into value. • The organization must learn how to communicate and interpret the results of analysis.
  • 24. Infrastructure Challenges • Infrastructure that is built for: • Large-scale • Distributed • Data-intensive jobs that spread the problem across clusters of server nodes
  • 25. Infrastructure Challenges – Cont. • Storage: • Efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • With intelligent capabilities to reduce your data footprint such as: • Data compression • Automatic data tiering • Data deduplication
  • 26. Infrastructure Challenges – Cont. • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • Security capabilities that protect highly-distributed infrastructure and data
  • 28. Apache Hadoop • Open source project run by Apache (2006). • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. • Apache Hadoop has been the driving force behind the growth of the big data Industry.
  • 30. Key points • An open-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. • The complete technology stack includes • common utilities • a distributed file system • analytics and data storage platforms • an application layer that manages distributed processing, parallel computation, workflow, and configuration management • Cost-effective for handling large unstructured data sets than conventional approaches, and it offers massive scalability and speed
  • 31. Why use Hadoop? Cost Flexibility Near linear performance up to 1000s of nodes Leverages commodity HW & open source SW Versatility with data, analytics & operation Scalability
  • 32. Really, Why use Hadoop? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant. • Need common infrastructure • Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 33. Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to leverage parallelism • Designed to scale • Flexible development platform • Solution Ecosystem
  • 34. Hadoop Limitations • Hadoop is scalable but not fast • Some assembly required • Batteries not included • Instrumentation not included either • DIY mindset (remember Linux/MySQL?) • On the larger scale – Hadoop is not cheap (but still cheaper than using old solutions)
  • 35. Example Comparison: RDBMS vs. Hadoop Typical Traditional RDBMS Hadoop Data Size Gigabytes Petabytes Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 36. Relational Database Best Used For:  Interactive OLAP Analytics (<1sec)  Multistep Transactions  100% SQL Compliance Best Used For:  Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing  Cheaper compared to RDBMS Best when used together Hadoop And Relational Database
  • 38. Hadoop Main Components • HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment. • MapReduce – programming paradigm for running processes over a clustered environments.
  • 39. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 39
  • 40. HDFS Node Types HDFS has three types of Nodes • Namenode (MasterNode) • Distribute files in the cluster • Responsible for the replication between the datanodes and for file blocks location • Datanodes • Responsible for actual file store • Serving data from files(data) to client • BackupNode (version 0.23 and up) • It’s a backup of the NameNode
  • 41. Typical implementation • Nodes are commodity PCs • 30-40 nodes per rack • Uplink from racks is 3-4 gigabit • Rack-internal is 1 gigabit
  • 42. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 42
  • 43. MapReduce Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar - input myInputDirs - output myOutputDir - mapper /bin/cat - reducer /bin/wc • Runs programs (jobs) across many computers • Protects against single server failure by re-run failed steps. • MR jobs can be written in Java, C, Phyton, Ruby and etc. • Users only write Map and Reduce functions • MAP - Takes a large problem and divides into sub problems. Performs the same function on all subsystems • REDUCE - Combine the output from all sub-problems
  • 44. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output 44 Map Reduce (Dean and Ghemawat, OSDI 2004)
  • 45. MapReduce paradigm • Implement two functions: • Map(k1, v1) -> list(k2, v2) • Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer 45
  • 47. MapReduce - word count example function map(String name, String document): for each word w in document: emit(w, 1) function reduce(String word, Iterator partialCounts): totalCount = 0 for each count in partialCounts: totalCount += count emit(word, totalCount) 47
  • 48. MapReduce Word Count Process http://brillix.co.il48
  • 49. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 49
  • 50. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 50
  • 51. MapReduce is NOT good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 51
  • 53. Improving Hadoop Core Hadoop is complicated so some tools were added to make things easier so tools were created to make things easier. Improving programmability: • Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting data • Hive: enables Hadoop to operate as data warehouse using SQL-like syntax.
  • 54. Pig • Data flow processing • Uses Pig Latin query language • Highly parallel in order to distribute data processing across many servers • Combining multiple data sources (Files, Hbase, Hive) • Example:
  • 55. Hive • Built on the MapReduce framework so it generates MR jobs behind it • Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS/Hbase. • Have partitioning and partition swapping • Good for random sampling • Example: CREATE EXTERNAL TABLE vs_hdfs ( site_id string, session_id string, time_stamp bigint, visitor_id bigint, row_unit string, evts string, biz string, plne string, dims string) partitioned by (site string,day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' STORED AS SEQUENCEFILE LOCATION '/home/data/'; select session_id, get_json_object(concat(tttt, "}"), '$.BY'), get_json_object(concat(tttt, "}"), '$.TEXT') from ( select session_id,concat("{", regexp_replace(event, "[{|}]", ""), "}") tttt from ( select session_id,get_json_object(plne, '$.PLine.evts[*]') pln from vs_hdfs_v1 where site='6964264' and day='20120201' and plne!='{}' limit 10 ) t LATERAL VIEW explode(split(pln, "},{")) adTable AS event )t2
  • 57. Improving Hadoop (cont.) For improving access: • HBase: column oriented database that runs on HDFS. • Sqoop: a tool designed to import data from relational databases (HDFS or Hive).
  • 58. Hbase What is Hbase and why should you use HBase? • Huge volumes of randomly accessed data. • There is no restrictions on column numbers for rows it’s dynamic. • Consider HBase when you’re loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema. Hbase dont’s? • It doesn’t talk SQL, have an optimizer, support in transactions or joins. If you don’t use any of these in your database application then HBase could very well be the perfect fit. Example: create ‘blogposts’, ‘post’, ‘image’ ---create table put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value get ‘blogposts’, ‘id1′ ---select records
  • 59. Sqoop What is Sqoop? • It’s a command line tool for moving data between HDFS and relational database systems. • You can download drivers for Sqoop from Microsoft and • Import Data/Query results from SQL Server to Hadoop. • Export Data from Hadoop to SQL Server. • It’s like BCP • Example: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --hive-import $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData
  • 60. Improving Hadoop (cont.) • For improving coordination: Zookeeper • For improving scheduling/orchestration: Oozie • For Improving UI: Hue • Machine learning: Mahout
  • 63. 63 Hadoop cluster Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
  • 64. Hadoop In The Real World http://brillix.co.il64
  • 66. Big Data Market Survey • 3 major groups for rolling your own Big Data: • Integrated Hadoop providers. • Analytical database with Hadoop connectivity. • Hadoop-centered companies. • Big Data on the Cloud.
  • 67. Integrated Hadoop Providers • IBM InfoSphere Database DB2 Deployment options Software (Enterprise Linux), Cloud Hadoop Bundled distribution (InfoSphere BigInsights); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene NoSQL HBase
  • 68. Integrated Hadoop Providers • Microsoft Database SQL Server Deployment options Software (Windows Server), Cloud (Windows Azure Cloud) Hadoop Bundled distribution (Big Data Solution); Hive, Pig NoSQL None
  • 69. Integrated Hadoop Providers • Oracle Database None Deployment options Appliance (Oracle Big Data Appliance) Hadoop Bundled distribution (Cloudera’s Distribution including Apache Hadoop); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr NoSQL Oracle NoSQL Database
  • 70. Integrated Hadoop Providers • Pivotal Greenplum Database` GreenPlum Database Deployment options Appliance (Modular Data Computing appliance), Software (Enterprise Linux), Cloud (Cloud Foundry) Hadoop Bundled distribution (Pivotal HD); Hive, Pig, Zookeeper, HBase NoSQL HBase
  • 71. Hadoop Centered Companies • Cloudera – longest-established of Hadoop distribution. • Hortonworks – major contributor to the Hadoop code and core components. • MapR.
  • 72. Big Data and Cloud • Some Big Data solution can be provided using IaaS: Infrastructure as a service. • Private clouds can be constructed using Hadoop orchestration tools. • Public clouds provided by Rockspace or Amazon EC2 can be use to start an Hadoop cluster.
  • 73. Big Data and Cloud (cont.) • PaaS: Platform as a Service can be used to remove the need to configure or scale things. • The major PaaS Providers are Amazon, Google and Microsoft.
  • 74. PaaS Services: Amazon • Amazon: • Elastic Map Reduce (EMR): MapReduce programs submitted to a cluster managed by Amazon. Good for EC2/S3 combinations. • DynamoDB: NoSQL database provided by Amazon to replace HBase.
  • 75. PaaS Services: Google • Google: • BigQuery: analytical database suitable for interactive analysis over datasets of the order of 1TB. • Prediction API: machine learning platform for classification and sentiment analysis be done with their tools on customers data.
  • 76. PaaS Services: Microsoft • Microsoft: • Windows Azure: a cloud computing platform and infrastructure that can be used as PasS and as IaaS.
  • 77. Who Handles Big Data … and how? http://brillix.co.il77
  • 78. Big Data Readiness • The R&D Prototype Stage • Skills needed: • Distributed data deployment (e.g. Hadoop) • Python or Java programming with MapReduce • Statistical analysis (e.g. R) • Data integration • Ability to formulate business hypotheses • Ability to convey business value of Big Data
  • 79. Data Science • A discipline that combines math, statistics, programming and scientific instinct with the goal of extracting meaning from data. • Data scientists combine technical expertise curiosity, storytelling and cleverness to find and deliver the signal in the noise.
  • 80. The Rise of the Data Scientist • Data scientists are responsible for • modeling complex business problems • discovering business insights • identifying opportunities. • Demand is high for people who can help make sense of the massive streams of digital information pouring into organizations
  • 81. Big Data Scientist • Industry Expertise • AnalyticsSkills Big Data Engineers • Hadoop/Java • Non-Relational DB Agility and Focus on Value New Roles and Skills
  • 82.
  • 83. Predictive Analytics • Predictive analytics looks into the future to provide insight into what will happen and includes what-if scenarios and risk assessment. It can be used for • Forecasting • hypothesis testing • risk modeling • propensity modeling
  • 84. Prescriptive analytics • Prescriptive analytics is focused on understanding what would happen based on different alternatives and scenarios, and then choosing best options, and optimizing what’s ahead. Use cases include • Customer cross-channel optimization • best-action-related offers • portfolio and business optimization • risk management
  • 85. How Predictive Analytics Works • Traditional BI tools use a deductive approach to data, which assumes some understanding of existing patterns and relationships. • An analytics model approaches the data based on this knowledge. • For obvious reasons, deductive methods work well with structured data
  • 86. Inductive approach • An inductive approach makes no presumptions of patterns or relationships and is more about data discovery. Predictive analytics applies inductive reasoning to big data using sophisticated quantitative methods such as • machine learning • neural networks • Robotics • computational mathematics • artificial intelligence • Explore all the data and to discover interrelationships and patterns
  • 87. Inductive approach – Cont. • Inductive methods use algorithms to perform complex calculations specifically designed to run against highly varied or large volumes of data • The result of applying these techniques to a real-world business problem is a predictive model • The ability to know what algorithms and data to use to test and create the predictive model is part of the science and art of predictive analytics
  • 88. Share nothing vs. Share everything Share nothing Share everything Many processing engines Many Servers Data is spread on many nodes Data is located on a single storage Joins are problematic Efficient Joins Very Scalable Limited Scalability
  • 89. Big Data and NoSQL http://brillix.co.il89
  • 90. The Challenge • We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need • RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages 90
  • 91. The Solution: NoSQL • Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses • NoSQL databases have been around for ages under different names/solutions 91
  • 92. The NOSQL Movement • NOSQL is not a technology – it’s a concept. • We need high performance, scale out abilities or an agile structure. • We are now willing to sacrifice our sacred cows: consistency, transactions. • Over 150 different brands and solutions (http://nosql-database.org/).
  • 93. NoSQL or NOSQL • NoSQL is not No to SQL • NoSQL is not Never SQL • NOSQL = Not Only SQL
  • 94. Why NoSQL? • Some applications need very few database features, but need high scale. • Desire to avoid data/schema pre-design altogether for simple applications. • Need for a low-latency, low-overhead API to access data. • Simplicity -- do not need fancy indexing – just fast lookup by primary key.
  • 95. Why NoSQL? (cont.) • Developer friendly, DBAs not needed (?). • Schema-less. • Agile: non-structured (or semi-structured). • In Memory. • No (or loose) Transactions. • No joins.
  • 96.
  • 97. Is NoSQL a RDMS Replacement? NO 97 Well... Sometimes it does…
  • 98. RDBMS vs. NoSQL Rationale for choosing a persistent store: 98 Relational Architecture NoSQL Architecture High value, high density, complex Data Low value, low density, simple data Complex data relationships Very simple relationships Schema-centric Schema-free, unstructured or semistructured Data Designed to scale up & out Distributed storage and processing Lots of general purpose features/functionality Stripped down, special purpose data store High overhead ($ per operation) Low overhead ($ per operation)
  • 100. Scalability • NoSQL is sometimes very easy to scale out • Most have dynamic data partitioning and easy data distribution • But distributed system always come with a price: The CAP Theorem and impact on ACID transactions 100
  • 101. ACID Transactions Most DBMS are built with ACID transactions in mind: • Atomicity: All or nothing, performs write operations as a single transaction • Consistency: Any transaction will take the DB from one consistent state to another with no broken constraints, ensures replicas are identical on different nodes • Isolation: Other operations cannot access data that has been modified during a transaction that has not been completed yet • Durability: Ability to recover the committed transaction updates against any kind of system failure (transaction log) 101
  • 102. ACID Transactions (cont.) • ACID is usually implemented by a locking mechanism/manager • Distributed systems central locking can be a bottleneck in that system • Most NoSQL does not use/limit the ACID transactions and replaces it with something else… 102
  • 103. CAP Theorem • The CAP theorem states that in a distributed/partitioned application, you can only pick two of the following three characteristics: • Consistency. • Availability. • Partition Tolerance.
  • 105. NoSQL BASE • NoSQL usually provide BASE characteristics instead of ACID. BASE stands for: • Basically Available • Soft State • Eventual Consistency • It means that when an update is made in one place, the other partitions will see it over time - there might be an inconsistency window • read and write operations complete more quickly, lowering latency
  • 108. NoSQL Taxonomy Type Examples Key-Value Store Document Store Column Store Graph Store
  • 110. Key Value Store • Distributed hash tables. • Very fast to get a single value. • Examples: • Amazon DynamoDB • Berkeley DB • Redis • Riak • Cassandra
  • 111. Document Store • Similar to Key/Value, but value is a document. • JSON or something similar, flexible schema • Agile technology. • Examples: • MongoDB • CouchDB • CouchBase
  • 112. Column Store • One key, multiple attributes. • Hybrid row/column. • Examples: • Google BigTable • Hbase • Amazon’s SimpleDB • Cassandra
  • 113. How Records are Organized? • This is a logical table in RDBMS systems • Its physical organization is just like the logical one: column by column, row by row Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 http://brillix.co.il113
  • 114. Query Data • When we query data, records are read at the order they are organized in the physical structure • Even when we query a single column, we still need to read the entire table and extract the column Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 Select Col2 From MyTable Select * From MyTable http://brillix.co.il114
  • 115. How Does Column Store Save Data Organization in row store Organization in column store http://brillix.co.il116
  • 116. Graph Store • Inspired by Graph Theory. • Data model: Nodes, relationships, properties on both. • Relational Database have very hard time to represent a graph in the Database. • Example: • Neo4j • InfiniteGraph • RDF
  • 117. • An abstract representation of a set of objects where some pairs are connected by links. • Object (Vertex, Node) – can have attributes like name and value • Link (Edge, Arc, Relationship) – can have attributes like type and name or date What is Graph NODE Edge
  • 118. Graph Types Undirected Graph Directed Graph Pseudo Graph Multi Graph NODE Edge NODE NODE Edge NODE NODE NODE NODE
  • 119. More Graph Types Weighted Graph Labeled Graph Property Graph NODE 10 NODE NODE Like NODE NODE NODE friend, date 2013 Name:yosi, Age:40 Name:ami, Age:30
  • 121.
  • 122. Conclusion • Big Data is one of the hottest buzzwords in last few years – we should all know what it’s all about • DBAs are often called upon big data problems – today DBAs needs to know what to ask to provide good solutions even if it’s not a database related issue • NoSQL doesn’t have to be Big Data solutions but Big Data often use NoSQL solutions http://brillix.co.il123