SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Big Data

Big Data: hype or necessity?
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk

February 18, 2014

1 / 71
Big Data

Outline

1

Introduction
Big Data?

2

Big Data Technology
Hadoop
Pig, Hive
NoSQL

3

Big Data in my company?

4

Conclusions

2 / 71
Big Data
Introduction

Outline

1

Introduction
Big Data?

2

Big Data Technology
Hadoop
Pig, Hive
NoSQL

3

Big Data in my company?

4

Conclusions

3 / 71
Big Data
Introduction
Big Data?

Exponential growth of data

Big Data: This is just the beginning
100

9000

Percentage of uncertain data

80

7000
60

6000

5000

You are here

Social
Media
40

4000

VoIP

Percent of uncertain data

Volume in Exabytes

8000

Sensors
& Devices

20

3000

Enterprise
Data
0

2010
© 2013 International Business Machines Corporation

2015
4

4 / 71
Big Data
Introduction
Big Data?

Examples
Facebook hosts ≈ 10 billion photos ≈ 1 petabyte

Large Hadron Collider: will produce ≈ 15 petabytes per year

5 / 71
Big Data
Introduction
Big Data?

Examples
RFID readers

vehicle GPS traces

Smart energy meters

6 / 71
Big Data
Introduction
Big Data?

Examples from Flanders
Myriade

Be-Mobile / Touring Mobilis

Colruyt

avoid traffic jams
find optimal planning
7 / 71
Big Data
Introduction
Big Data?

Big Data definition

Definition of Big Data depends on who you ask:
Big Data
“Multiple terabytes or petabytes.”
(according to some professionals)
“I don’t know.”
(today’s big may be tomorrow’s normal)
“Relative to its context.”

8 / 71
Big Data
Introduction
Big Data?

Quotes on Big Data

“Big data” is a subjective label attached to situations in
which human and technical infrastructures are unable to
keep pace with a company’s data needs.
It’s about recognizing that for some problems other
storage solutions are better suited.

9 / 71
Big Data
Introduction
Big Data?

The Three V’s

Volume The amount of data is big.
Variety Different kinds of data:
structured
semi-structured
unstructured
Velocity Speed-issues to consider:
How fast is the data available for analysis?
How fast can we do something with it?
Other V’s: Veracity, Variability, Validity, Value,. . .

10 / 71
Big Data
Introduction
Big Data?

Structured data
Structured data
Pre-defined schema imposed on the data
Highly structured
Usually stored in a relational database system
Example
numbers: 20, 3.1415,. . .

strings: ”Hello World”

dates: 21/03/1978

...

Roughly 20% of all data out there is structured.

11 / 71
Big Data
Introduction
Big Data?

Semi-structured data

Semi-structured data
Inconsistent structure.
Cannot be stored in rows and tables in a typical database.
Information is often self-describing (label/value pairs).
Example
XML, SGML,. . .

tweets

BibTeX files

sensor feeds

logs

...

12 / 71
Big Data
Introduction
Big Data?

Semi-structured data: examples

Example
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer’s Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
</catalog>

13 / 71
Big Data
Introduction
Big Data?

Semi-structured data: examples
Example
@article{vandewoestyne2007cools_b,
author = {Vandewoestyne, Bart and Cools, Ronald},
title = {On obtaining higher order convergence for
smooth periodic functions},
journal = {Journal of Complexity},
year = {2008},
volume = {24},
number = {3},
pages = {328--340},
month = jun
}

14 / 71
Big Data
Introduction
Big Data?

Unstructured data
Definition (Unstructured data)
Lacks structure or parts of it lack structure.
Example
multimedia: videos, photos,
audio files,. . .

word processing documents

email messages

reports

free-form text

...

presentations

Experts estimate that 80 to 90 % of the data in any
organization is unstructured.
15 / 71
Big Data
Introduction
Big Data?

Data Storage and Analysis
Storage capacity of hard drives has increased massively over
the years.
Access speeds have not kept up.
Example (Reading a whole disk)
Year
1990
2010

Storage Capacity
1370 MB
1 TB

Transfer Speed
4.4 MB/s
100 MB/s

Time
≈ 5 minutes
> 2.5 hours

Solution: work in parallel!
Using 100 drives (each holding 1/100th of the data),
reading 1 TB takes less than 2 minutes.
16 / 71
Big Data
Introduction
Big Data?

Working in parallel
Problems
1 Hardware failure?
2

Combining data from different disks for analysis?

Solutions
1 HDFS: Hadoop Distributed Filesystem
2

MapReduce: programming model

17 / 71
Big Data
Big Data Technology

Outline

1

Introduction
Big Data?

2

Big Data Technology
Hadoop
Pig, Hive
NoSQL

3

Big Data in my company?

4

Conclusions

18 / 71
Big Data
Big Data Technology

Big Data Landscape

19 / 71
Big Data
Big Data Technology

Big Data Landscape

20 / 71
Big Data
Big Data Technology
Hadoop

Hadoop

Hadoop is VMware, but the other way around.

21 / 71
Big Data
Big Data Technology
Hadoop

Hadoop as the opposite of a virtual machine

VMware
1 take one physical server

Hadoop
1

take many physical servers

2

split it up

2

merge them all together

3

get many small virtual
servers

3

get one big, massive, virtual
server

22 / 71
Big Data
Big Data Technology
Hadoop

Hadoop: core functionality

HDFS Self-healing, high-bandwidth, clustered storage.
MapReduce Distributed, fault-tolerant resource management,
coupled with scalable data processing.
23 / 71
Big Data
Big Data Technology
Hadoop

HDFS architecture

24 / 71
Big Data
Big Data Technology
Hadoop

MapReduce

25 / 71
Big Data
Big Data Technology
Hadoop

MapReduce

26 / 71
Big Data
Big Data Technology
Hadoop

Hadoop: applications
Example Hadoop stack:

→ Hadoop distributions
27 / 71
Big Data
Big Data Technology
Hadoop

Example Hadoop distributions

28 / 71
Big Data
Big Data Technology
Hadoop

Hadoop vs RDBMS
Relational Database Management Systems (RDBMS):
some queries → msecs
other queries → hours, days
use when
latency is important
ACID transactions
(banking,. . . )
100% SQL compliance

Very fast to max speed!

Unstructured data → BLOB
:-(

29 / 71
Big Data
Big Data Technology
Hadoop

Hadoop vs RDBMS
Hadoop:
some queries → seconds,
minutes
other queries → seconds!!!
Use when:

Slower to (higher) max
speed. . .

throughput important
scalability of storage/compute
(un|semi)structured data
complex data processing
(NoSQL, Java, C, Python,. . . )

30 / 71
Big Data
Big Data Technology
Pig, Hive

Apache Hadoop essentials: technology stack

31 / 71
Big Data
Big Data Technology
Pig, Hive

Pig

MapReduce requires programmers
think in terms of map and reduce
functions,
more than likely use the Java language.

Pig provides a high-level language (Pig
Latin) that can be used by
Analysts
Data Scientists
Statisticians
Etc. . .

32 / 71
Big Data
Big Data Technology
Pig, Hive

Pig Latin

Pig Latin
Originally from Yahoo! to allow analysts to access data.
Dataflow language.
Makes it simpler to write MapReduce programs.
Abstracts you from specific details
→ focus on data processing.
Has User Defined Functions (UDFs).
Compiles script into a set of MapReduce jobs.

33 / 71
Big Data
Big Data Technology
Pig, Hive

Pig example

Load pages

Load users
Filter
by age
Join on
name
Group
on URL
Count
clicks

Input data
file with user data
file with website data
Your task
Find the top 5 most visited
pages by users aged 18-25.

Order
by clicks
Take top 5
34 / 71
Big Data
Big Data Technology
Pig, Hive

In MapReduce

. . . 170 lines of Java MapReduce code . . .

35 / 71
Big Data
Big Data Technology
Pig, Hive

In Pig Latin

Example
Users
Fltrd
Pages
Jnd
Grpd
Smmd
Srtd
Top5
store

= load ’users’ as (name, age);
= filter Users by age >= 18 and age <= 25;
= load ’pages’ as (user, url);
= join Fltrd by name, Pages by user;
= group Jnd by url;
= foreach Grpd generate group, COUNT(Jnd) as clicks;
= order Smmd by clicks desc;
= limit Srtd 5;
Top5 into ’top5sites’;

Only 9 lines of Pig Latin.

36 / 71
Big Data
Big Data Technology
Pig, Hive

Hive

Originated at Facebook to analyze log data.
HiveQL: Hive Query Language, similar to standard SQL.
Queries are compiled into MapReduce jobs.
Has command-line shell, similar to e.g. MySQL shell.

37 / 71
Big Data
Big Data Technology
Pig, Hive

Hive: example

Example (Create table to hold weather data)
CREATE TABLE records (year STRING,
temperature INT,
quality INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ’t’;
Example (Populate Hive with the data)
LOAD DATA LOCAL INPATH ’input/sample.txt’
OVERWRITE INTO TABLE records;

38 / 71
Big Data
Big Data Technology
Pig, Hive

Hive: example

Example (Run query)
hive>
>
>
>
>
1949
1950

SELECT year, MAX(temperature)
FROM records
WHERE temperature != 9999
AND (quality = 0 OR quality = 1)
GROUP BY year;
111
22

39 / 71
Big Data
Big Data Technology
NoSQL

NoSQL

40 / 71
Big Data
Big Data Technology
NoSQL

RDBMS: Codd’s 12 rules

Codd’s 12 rules
A set of rules designed to define what is required from a database
management system in order for it to be considered relational.
Rule 0 The Foundation rule
Rule 1 The Information rule
Rule 2 The guaranteed access rule
Rule 3 Systematic treatment of null values
Rule 4 Active online catalog based on the relational model
... ...

41 / 71
Big Data
Big Data Technology
NoSQL

ACID

ACID
A set of properties that guarantee that database transactions are
processed reliably.
Atomicity A transaction is all or nothing.
Consistency Only transactions with valid data.
Isolation Simultaneous transactions will not interfere.
Durability Written transaction data stays there “forever”
(even in case of power loss, crashes, errors,. . . ).

42 / 71
Big Data
Big Data Technology
NoSQL

Scaling up
What if you need to scale up your RDBMS in terms of
dataset size,
read/write concurrency?
This usually involves
breaking Codds rules,
loosening ACID restrictions,
forgetting conventional DBA wisdom,
loose most of the desirable properties that made RDBMS so
convenient in the first place.
NoSQL to the rescue!
43 / 71
Big Data
Big Data Technology
NoSQL

NoSQL

NoSQL
‘Invented’ by Carl Strozzi in 1998 (for his file-based database)

“Not only SQL”
It’s NOT about
saying that SQL should never be used,
saying that SQL is dead.

44 / 71
Big Data
Big Data Technology
NoSQL

NoSQL databases
Four emerging NoSQL categories:

45 / 71
Big Data
Big Data Technology
NoSQL

Key-Value stores or ‘the big hash table’

Keys

Values

13a1

Nexus 32 GB

13a2

Nexus 16 GB

13a3

Nexus 08 GB

Most basic type of NoSQL
databases.
Aggregation of key-value
pairs.
Typically only 4 operations:
create(key, value)
read(key)
update(key, value)
delete(key)

Fast, scalable, less complex.
Mainly used for systems with simple queries (caches etc. . . . )

46 / 71
Big Data
Big Data Technology
NoSQL

Key-Value stores or ’the big hash table’

47 / 71
Big Data
Big Data Technology
NoSQL

Column-oriented DBMS
Example
Id
10
12
11
22

LastName
Smith
Jones
Johnson
Jones

FirstName
Joe
Mary
Cathy
Bob

Salary
40000
50000
44000
55000

Row-based:
10,Smith,Joe,40000;12,Jones,Mary,50000;11,Johnson,Cathy,44000;22,Jones,Bob,55000

Column-based:
10,12,11,22;Smith,Jones,Johnson,Jones;Joe,Mary,Cathy,Bob;40000,50000,44000,55000

48 / 71
Big Data
Big Data Technology
NoSQL

Column family based databases

Like column-oriented DBMS, but with a twist
Columns and supercolumns ≈ RDBMS table columns
Family of columns ≈ RDBMS table
Keyspace ≈ RDBMS database
49 / 71
Big Data
Big Data Technology
NoSQL

Column family based databases

Most complex NoSQL database type.
Based on Google’s BigTable paper.
More flexibility than traditional RDBMS:
adding (super)columns is always possible.
Excellent for analysis and mass treatment of data
(via Map-Reduce type operations)

50 / 71
Big Data
Big Data Technology
NoSQL

Document databases
Data is stored as a collection of
documents
(JSON, XML,. . . but also PDF,
Excel,. . . )
Documents → collection of
key-value pairs
Values can be
simple values
arrays
another document (collection of
key-values)

Schemaless
Quite well queryable
51 / 71
Big Data
Big Data Technology
NoSQL

Document databases
Example (Document 2)
{
FirstName: "Jonathan",
Address: "15 Wanamassa Road",
Children: [
{Name: "Michael", Age: 10},
{Name: "Jennifer", Age: 8},
{Name: "Samantha", Age: 5},
{Name: "Elena", Age: 2}
]

Example (Document 1)
{
FirstName: "Bob",
Address: "5 Oak St.",
Hobby: "sailing"
}
}

Best suited for custom queries like the ones in RDBMS.
Quite popular for Content Management Systems.

52 / 71
Big Data
Big Data Technology
NoSQL

Document databases: examples

53 / 71
Big Data
Big Data Technology
NoSQL

Graph databases
Sister in-Law To

Julie

ed

Lis
Rock
Music

Steve

M
arr
i

o
sT
ten

To

Listens To

o
Br
Bob
Colleague
Of

Fido

Has Pet

Jim

f
rO

e

th

Drives

W
ork
s

BMW

Fo
r

Works For

IBM

Based on graph theory.
Employ nodes (objects) and edges (relations between objects).
54 / 71
Big Data
Big Data Technology
NoSQL

Graph databases: examples

Well-suited for problems with network-structure:
mine data from social media
“customers who bought this also looked at. . . ”
relations between persons
...

55 / 71
Big Data
Big Data Technology
NoSQL

Us the right tool for the right job!

http://db-engines.com/
56 / 71
Big Data
Big Data in my company?

Outline

1

Introduction
Big Data?

2

Big Data Technology
Hadoop
Pig, Hive
NoSQL

3

Big Data in my company?

4

Conclusions

57 / 71
Big Data
Big Data in my company?

Typical RDBMS scaling story
1. Initial Public Launch
From local workstation → remotely hosted MySQL instance.
2. Service popularity ↑, too many reads hitting the database
Add memcached to cache common queries. Reads are now no
longer strictly ACID; cached data must expire.
3. Popularity ↑↑, too many writes hitting the database
Scale MySQL vertically by buying a beefed-up server:

16 cores


Costly
128 GB of RAM


banks of 15 k RPM hard drives
58 / 71
Big Data
Big Data in my company?

Typical RDBMS scaling story

4. New features → query complexity ↑, now too many joins
Denormalize your data to reduce joins.
(Thats not what they taught me in DBA school!)
5. Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.

59 / 71
Big Data
Big Data in my company?

Typical RDBMS scaling story
6. Some queries are still too slow
Periodically prematerialize the most complex queries, and try to
stop joining in most cases.
7. Reads are OK, writes are getting slower and slower. . .
Drop secondary indexes and triggers (no indexes?).
If you stay up at night
worrying about your database
(uptime, scale, or speed), you
should seriously consider
making a jump from the
RDBMS world to HBase.
60 / 71
Big Data
Big Data in my company?

Two types of companies (personal view)

‘Core Big Data’ company
Core business = big data processing, crunching, analyzing,. . .
Example
Google, Facebook,. . .
Smart metering companies
Video/Image processing companies
Biotech companies with sequencing data
...

61 / 71
Big Data
Big Data in my company?

Two types of companies (personal view)

‘General Big Data’ company
Some other core business.
Lots of useful data is available.
Desirable: business analytics, process optimization,. . .
Example
Supermarkets → customer cards
Transport firms → GPS-traces
...

62 / 71
Big Data
Big Data in my company?

Use-cases of Big Data

‘Core Big Data’ company
Big Data

‘General Big Data’ company
Business Analytics
improve decision-making,

crunching,

gain operational insights,

hacking,

increase overall
performance,

processing,
analyzing,
...

track and analyze
shopping patterns,
...

Both
Explore! Discover hidden gems!
63 / 71
Big Data
Big Data in my company?

Some examples

Intrusion detection based on
server log data
Real-time security analytics
Fraud detection

Customer behavior based
sentiment analysis of social
media
Campaign analytics

64 / 71
Big Data
Big Data in my company?

Some examples
How to predict wine quality?
Skip tasting! Use science!
Weather seems the key
variable.
Correlate historical weather
& wine data.
Reduce fuel cost and
improve driver safety by
analyzing geolocation data

65 / 71
Big Data
Big Data in my company?

Big Data in your company
Big data is typically a division of the IT-department.
Requires skilled people:
sysadmins
software developers
data-scientists
visualization experts
...

Advice, trend (Andrew McAfee)
Give geeks a seat at the decision-making table.

66 / 71
Big Data
Big Data in my company?

Big Data in your company

67 / 71
Big Data
Big Data in my company?

IWT TETRA project

Our current mission
IWT TETRA project
Submission deadline: March 12, 2014
Three pillars
New to Big Data Tech? → Explain, Advise and Help
Already using Big Data Tech? → Benchmark and Tune
Got Data? → Analyse and Visualize
Interested? → Come talk to us!

68 / 71
Big Data
Conclusions

Outline

1

Introduction
Big Data?

2

Big Data Technology
Hadoop
Pig, Hive
NoSQL

3

Big Data in my company?

4

Conclusions

69 / 71
Big Data
Conclusions

Conclusions

“Big” can be small too.
The Big Data landscape is huge.
RDBMS and SQL are not dead.
The right tool for the right job!
Your company can benefit from Big Data technology.
We can help.
Be brave in your quest. . .

70 / 71
Big Data
Conclusions

Questions?

Questions?

johan@sizingservers.be
bart@sizingservers.be
71 / 71

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Edureka!
 

Was ist angesagt? (20)

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
 

Ähnlich wie Big Data: hype or necessity?

Big data présentation
Big data présentationBig data présentation
Big data présentationAbdo Bim
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big DataJean-Marc Desvaux
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsxSangeetaTripathi8
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 

Ähnlich wie Big Data: hype or necessity? (20)

Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Big data présentation
Big data présentationBig data présentation
Big data présentation
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
Big data
Big dataBig data
Big data
 

Kürzlich hochgeladen

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Kürzlich hochgeladen (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Big Data: hype or necessity?

  • 1. Big Data Big Data: hype or necessity? Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk February 18, 2014 1 / 71
  • 2. Big Data Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 2 / 71
  • 3. Big Data Introduction Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 3 / 71
  • 4. Big Data Introduction Big Data? Exponential growth of data Big Data: This is just the beginning 100 9000 Percentage of uncertain data 80 7000 60 6000 5000 You are here Social Media 40 4000 VoIP Percent of uncertain data Volume in Exabytes 8000 Sensors & Devices 20 3000 Enterprise Data 0 2010 © 2013 International Business Machines Corporation 2015 4 4 / 71
  • 5. Big Data Introduction Big Data? Examples Facebook hosts ≈ 10 billion photos ≈ 1 petabyte Large Hadron Collider: will produce ≈ 15 petabytes per year 5 / 71
  • 6. Big Data Introduction Big Data? Examples RFID readers vehicle GPS traces Smart energy meters 6 / 71
  • 7. Big Data Introduction Big Data? Examples from Flanders Myriade Be-Mobile / Touring Mobilis Colruyt avoid traffic jams find optimal planning 7 / 71
  • 8. Big Data Introduction Big Data? Big Data definition Definition of Big Data depends on who you ask: Big Data “Multiple terabytes or petabytes.” (according to some professionals) “I don’t know.” (today’s big may be tomorrow’s normal) “Relative to its context.” 8 / 71
  • 9. Big Data Introduction Big Data? Quotes on Big Data “Big data” is a subjective label attached to situations in which human and technical infrastructures are unable to keep pace with a company’s data needs. It’s about recognizing that for some problems other storage solutions are better suited. 9 / 71
  • 10. Big Data Introduction Big Data? The Three V’s Volume The amount of data is big. Variety Different kinds of data: structured semi-structured unstructured Velocity Speed-issues to consider: How fast is the data available for analysis? How fast can we do something with it? Other V’s: Veracity, Variability, Validity, Value,. . . 10 / 71
  • 11. Big Data Introduction Big Data? Structured data Structured data Pre-defined schema imposed on the data Highly structured Usually stored in a relational database system Example numbers: 20, 3.1415,. . . strings: ”Hello World” dates: 21/03/1978 ... Roughly 20% of all data out there is structured. 11 / 71
  • 12. Big Data Introduction Big Data? Semi-structured data Semi-structured data Inconsistent structure. Cannot be stored in rows and tables in a typical database. Information is often self-describing (label/value pairs). Example XML, SGML,. . . tweets BibTeX files sensor feeds logs ... 12 / 71
  • 13. Big Data Introduction Big Data? Semi-structured data: examples Example <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer’s Guide</title> <genre>Computer</genre> <price>44.95</price> </book> </catalog> 13 / 71
  • 14. Big Data Introduction Big Data? Semi-structured data: examples Example @article{vandewoestyne2007cools_b, author = {Vandewoestyne, Bart and Cools, Ronald}, title = {On obtaining higher order convergence for smooth periodic functions}, journal = {Journal of Complexity}, year = {2008}, volume = {24}, number = {3}, pages = {328--340}, month = jun } 14 / 71
  • 15. Big Data Introduction Big Data? Unstructured data Definition (Unstructured data) Lacks structure or parts of it lack structure. Example multimedia: videos, photos, audio files,. . . word processing documents email messages reports free-form text ... presentations Experts estimate that 80 to 90 % of the data in any organization is unstructured. 15 / 71
  • 16. Big Data Introduction Big Data? Data Storage and Analysis Storage capacity of hard drives has increased massively over the years. Access speeds have not kept up. Example (Reading a whole disk) Year 1990 2010 Storage Capacity 1370 MB 1 TB Transfer Speed 4.4 MB/s 100 MB/s Time ≈ 5 minutes > 2.5 hours Solution: work in parallel! Using 100 drives (each holding 1/100th of the data), reading 1 TB takes less than 2 minutes. 16 / 71
  • 17. Big Data Introduction Big Data? Working in parallel Problems 1 Hardware failure? 2 Combining data from different disks for analysis? Solutions 1 HDFS: Hadoop Distributed Filesystem 2 MapReduce: programming model 17 / 71
  • 18. Big Data Big Data Technology Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 18 / 71
  • 19. Big Data Big Data Technology Big Data Landscape 19 / 71
  • 20. Big Data Big Data Technology Big Data Landscape 20 / 71
  • 21. Big Data Big Data Technology Hadoop Hadoop Hadoop is VMware, but the other way around. 21 / 71
  • 22. Big Data Big Data Technology Hadoop Hadoop as the opposite of a virtual machine VMware 1 take one physical server Hadoop 1 take many physical servers 2 split it up 2 merge them all together 3 get many small virtual servers 3 get one big, massive, virtual server 22 / 71
  • 23. Big Data Big Data Technology Hadoop Hadoop: core functionality HDFS Self-healing, high-bandwidth, clustered storage. MapReduce Distributed, fault-tolerant resource management, coupled with scalable data processing. 23 / 71
  • 24. Big Data Big Data Technology Hadoop HDFS architecture 24 / 71
  • 25. Big Data Big Data Technology Hadoop MapReduce 25 / 71
  • 26. Big Data Big Data Technology Hadoop MapReduce 26 / 71
  • 27. Big Data Big Data Technology Hadoop Hadoop: applications Example Hadoop stack: → Hadoop distributions 27 / 71
  • 28. Big Data Big Data Technology Hadoop Example Hadoop distributions 28 / 71
  • 29. Big Data Big Data Technology Hadoop Hadoop vs RDBMS Relational Database Management Systems (RDBMS): some queries → msecs other queries → hours, days use when latency is important ACID transactions (banking,. . . ) 100% SQL compliance Very fast to max speed! Unstructured data → BLOB :-( 29 / 71
  • 30. Big Data Big Data Technology Hadoop Hadoop vs RDBMS Hadoop: some queries → seconds, minutes other queries → seconds!!! Use when: Slower to (higher) max speed. . . throughput important scalability of storage/compute (un|semi)structured data complex data processing (NoSQL, Java, C, Python,. . . ) 30 / 71
  • 31. Big Data Big Data Technology Pig, Hive Apache Hadoop essentials: technology stack 31 / 71
  • 32. Big Data Big Data Technology Pig, Hive Pig MapReduce requires programmers think in terms of map and reduce functions, more than likely use the Java language. Pig provides a high-level language (Pig Latin) that can be used by Analysts Data Scientists Statisticians Etc. . . 32 / 71
  • 33. Big Data Big Data Technology Pig, Hive Pig Latin Pig Latin Originally from Yahoo! to allow analysts to access data. Dataflow language. Makes it simpler to write MapReduce programs. Abstracts you from specific details → focus on data processing. Has User Defined Functions (UDFs). Compiles script into a set of MapReduce jobs. 33 / 71
  • 34. Big Data Big Data Technology Pig, Hive Pig example Load pages Load users Filter by age Join on name Group on URL Count clicks Input data file with user data file with website data Your task Find the top 5 most visited pages by users aged 18-25. Order by clicks Take top 5 34 / 71
  • 35. Big Data Big Data Technology Pig, Hive In MapReduce . . . 170 lines of Java MapReduce code . . . 35 / 71
  • 36. Big Data Big Data Technology Pig, Hive In Pig Latin Example Users Fltrd Pages Jnd Grpd Smmd Srtd Top5 store = load ’users’ as (name, age); = filter Users by age >= 18 and age <= 25; = load ’pages’ as (user, url); = join Fltrd by name, Pages by user; = group Jnd by url; = foreach Grpd generate group, COUNT(Jnd) as clicks; = order Smmd by clicks desc; = limit Srtd 5; Top5 into ’top5sites’; Only 9 lines of Pig Latin. 36 / 71
  • 37. Big Data Big Data Technology Pig, Hive Hive Originated at Facebook to analyze log data. HiveQL: Hive Query Language, similar to standard SQL. Queries are compiled into MapReduce jobs. Has command-line shell, similar to e.g. MySQL shell. 37 / 71
  • 38. Big Data Big Data Technology Pig, Hive Hive: example Example (Create table to hold weather data) CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; Example (Populate Hive with the data) LOAD DATA LOCAL INPATH ’input/sample.txt’ OVERWRITE INTO TABLE records; 38 / 71
  • 39. Big Data Big Data Technology Pig, Hive Hive: example Example (Run query) hive> > > > > 1949 1950 SELECT year, MAX(temperature) FROM records WHERE temperature != 9999 AND (quality = 0 OR quality = 1) GROUP BY year; 111 22 39 / 71
  • 40. Big Data Big Data Technology NoSQL NoSQL 40 / 71
  • 41. Big Data Big Data Technology NoSQL RDBMS: Codd’s 12 rules Codd’s 12 rules A set of rules designed to define what is required from a database management system in order for it to be considered relational. Rule 0 The Foundation rule Rule 1 The Information rule Rule 2 The guaranteed access rule Rule 3 Systematic treatment of null values Rule 4 Active online catalog based on the relational model ... ... 41 / 71
  • 42. Big Data Big Data Technology NoSQL ACID ACID A set of properties that guarantee that database transactions are processed reliably. Atomicity A transaction is all or nothing. Consistency Only transactions with valid data. Isolation Simultaneous transactions will not interfere. Durability Written transaction data stays there “forever” (even in case of power loss, crashes, errors,. . . ). 42 / 71
  • 43. Big Data Big Data Technology NoSQL Scaling up What if you need to scale up your RDBMS in terms of dataset size, read/write concurrency? This usually involves breaking Codds rules, loosening ACID restrictions, forgetting conventional DBA wisdom, loose most of the desirable properties that made RDBMS so convenient in the first place. NoSQL to the rescue! 43 / 71
  • 44. Big Data Big Data Technology NoSQL NoSQL NoSQL ‘Invented’ by Carl Strozzi in 1998 (for his file-based database) “Not only SQL” It’s NOT about saying that SQL should never be used, saying that SQL is dead. 44 / 71
  • 45. Big Data Big Data Technology NoSQL NoSQL databases Four emerging NoSQL categories: 45 / 71
  • 46. Big Data Big Data Technology NoSQL Key-Value stores or ‘the big hash table’ Keys Values 13a1 Nexus 32 GB 13a2 Nexus 16 GB 13a3 Nexus 08 GB Most basic type of NoSQL databases. Aggregation of key-value pairs. Typically only 4 operations: create(key, value) read(key) update(key, value) delete(key) Fast, scalable, less complex. Mainly used for systems with simple queries (caches etc. . . . ) 46 / 71
  • 47. Big Data Big Data Technology NoSQL Key-Value stores or ’the big hash table’ 47 / 71
  • 48. Big Data Big Data Technology NoSQL Column-oriented DBMS Example Id 10 12 11 22 LastName Smith Jones Johnson Jones FirstName Joe Mary Cathy Bob Salary 40000 50000 44000 55000 Row-based: 10,Smith,Joe,40000;12,Jones,Mary,50000;11,Johnson,Cathy,44000;22,Jones,Bob,55000 Column-based: 10,12,11,22;Smith,Jones,Johnson,Jones;Joe,Mary,Cathy,Bob;40000,50000,44000,55000 48 / 71
  • 49. Big Data Big Data Technology NoSQL Column family based databases Like column-oriented DBMS, but with a twist Columns and supercolumns ≈ RDBMS table columns Family of columns ≈ RDBMS table Keyspace ≈ RDBMS database 49 / 71
  • 50. Big Data Big Data Technology NoSQL Column family based databases Most complex NoSQL database type. Based on Google’s BigTable paper. More flexibility than traditional RDBMS: adding (super)columns is always possible. Excellent for analysis and mass treatment of data (via Map-Reduce type operations) 50 / 71
  • 51. Big Data Big Data Technology NoSQL Document databases Data is stored as a collection of documents (JSON, XML,. . . but also PDF, Excel,. . . ) Documents → collection of key-value pairs Values can be simple values arrays another document (collection of key-values) Schemaless Quite well queryable 51 / 71
  • 52. Big Data Big Data Technology NoSQL Document databases Example (Document 2) { FirstName: "Jonathan", Address: "15 Wanamassa Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, {Name: "Samantha", Age: 5}, {Name: "Elena", Age: 2} ] Example (Document 1) { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } } Best suited for custom queries like the ones in RDBMS. Quite popular for Content Management Systems. 52 / 71
  • 53. Big Data Big Data Technology NoSQL Document databases: examples 53 / 71
  • 54. Big Data Big Data Technology NoSQL Graph databases Sister in-Law To Julie ed Lis Rock Music Steve M arr i o sT ten To Listens To o Br Bob Colleague Of Fido Has Pet Jim f rO e th Drives W ork s BMW Fo r Works For IBM Based on graph theory. Employ nodes (objects) and edges (relations between objects). 54 / 71
  • 55. Big Data Big Data Technology NoSQL Graph databases: examples Well-suited for problems with network-structure: mine data from social media “customers who bought this also looked at. . . ” relations between persons ... 55 / 71
  • 56. Big Data Big Data Technology NoSQL Us the right tool for the right job! http://db-engines.com/ 56 / 71
  • 57. Big Data Big Data in my company? Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 57 / 71
  • 58. Big Data Big Data in my company? Typical RDBMS scaling story 1. Initial Public Launch From local workstation → remotely hosted MySQL instance. 2. Service popularity ↑, too many reads hitting the database Add memcached to cache common queries. Reads are now no longer strictly ACID; cached data must expire. 3. Popularity ↑↑, too many writes hitting the database Scale MySQL vertically by buying a beefed-up server:  16 cores   Costly 128 GB of RAM   banks of 15 k RPM hard drives 58 / 71
  • 59. Big Data Big Data in my company? Typical RDBMS scaling story 4. New features → query complexity ↑, now too many joins Denormalize your data to reduce joins. (Thats not what they taught me in DBA school!) 5. Rising popularity swamps the server; things are too slow Stop doing any server-side computations. 59 / 71
  • 60. Big Data Big Data in my company? Typical RDBMS scaling story 6. Some queries are still too slow Periodically prematerialize the most complex queries, and try to stop joining in most cases. 7. Reads are OK, writes are getting slower and slower. . . Drop secondary indexes and triggers (no indexes?). If you stay up at night worrying about your database (uptime, scale, or speed), you should seriously consider making a jump from the RDBMS world to HBase. 60 / 71
  • 61. Big Data Big Data in my company? Two types of companies (personal view) ‘Core Big Data’ company Core business = big data processing, crunching, analyzing,. . . Example Google, Facebook,. . . Smart metering companies Video/Image processing companies Biotech companies with sequencing data ... 61 / 71
  • 62. Big Data Big Data in my company? Two types of companies (personal view) ‘General Big Data’ company Some other core business. Lots of useful data is available. Desirable: business analytics, process optimization,. . . Example Supermarkets → customer cards Transport firms → GPS-traces ... 62 / 71
  • 63. Big Data Big Data in my company? Use-cases of Big Data ‘Core Big Data’ company Big Data ‘General Big Data’ company Business Analytics improve decision-making, crunching, gain operational insights, hacking, increase overall performance, processing, analyzing, ... track and analyze shopping patterns, ... Both Explore! Discover hidden gems! 63 / 71
  • 64. Big Data Big Data in my company? Some examples Intrusion detection based on server log data Real-time security analytics Fraud detection Customer behavior based sentiment analysis of social media Campaign analytics 64 / 71
  • 65. Big Data Big Data in my company? Some examples How to predict wine quality? Skip tasting! Use science! Weather seems the key variable. Correlate historical weather & wine data. Reduce fuel cost and improve driver safety by analyzing geolocation data 65 / 71
  • 66. Big Data Big Data in my company? Big Data in your company Big data is typically a division of the IT-department. Requires skilled people: sysadmins software developers data-scientists visualization experts ... Advice, trend (Andrew McAfee) Give geeks a seat at the decision-making table. 66 / 71
  • 67. Big Data Big Data in my company? Big Data in your company 67 / 71
  • 68. Big Data Big Data in my company? IWT TETRA project Our current mission IWT TETRA project Submission deadline: March 12, 2014 Three pillars New to Big Data Tech? → Explain, Advise and Help Already using Big Data Tech? → Benchmark and Tune Got Data? → Analyse and Visualize Interested? → Come talk to us! 68 / 71
  • 69. Big Data Conclusions Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 69 / 71
  • 70. Big Data Conclusions Conclusions “Big” can be small too. The Big Data landscape is huge. RDBMS and SQL are not dead. The right tool for the right job! Your company can benefit from Big Data technology. We can help. Be brave in your quest. . . 70 / 71