Big Data: an introduction

Big Data
Big Data: an introduction
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
March 28, 2014
1 / 51

Big Data
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
2 / 51

Big Data
Introduction: Big Data?
Outline
4 IWT TETRA project
5 Conclusions
3 / 51

Big Data
Exponential growth of data
© 2013 International Business Machines Corporation 4
Big Data: This is just the beginning
2010
VolumeinExabytes
9000
8000
7000
6000
5000
4000
3000
2015
Percentage of uncertain data
Percentofuncertaindata
100
80
60
40
20
0
You are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
4 / 51

Big Data
Big Data deﬁnition
Deﬁnition of Big Data depends on who you ask:
Big Data
“Multiple terabytes or petabytes.”
(according to some professionals)
“I don’t know.”
(today’s big may be tomorrow’s normal)
“Relative to its context.”
5 / 51

Big Data
Quotes on Big Data
“Big data” is a subjective label attached to situations in
which human and technical infrastructures are unable to
keep pace with a company’s data needs.
It’s about recognizing that for some problems other
storage solutions are better suited.
6 / 51

Big Data
The Three V’s
Volume The amount of data is big.
Variety Diﬀerent kinds of data:
structured
semi-structured
unstructured
Velocity Speed-issues to consider:
How fast is the data available for analysis?
How fast can we do something with it?
Other V’s: Veracity, Variability, Validity, Value,. . .
7 / 51

Big Data
Structured data
Structured data
Pre-deﬁned schema imposed on the data
Highly structured
Usually stored in a relational database system
Example
numbers: 20, 3.1415,. . .
dates: 21/03/1978
strings: ”Hello World”
. . .
Roughly 20% of all data out there is structured.
8 / 51

Big Data
Semi-structured data
Semi-structured data
Inconsistent structure.
Cannot be stored in rows and tables in a typical database.
Information is often self-describing (label/value pairs).
Example
XML, SGML,. . .
BibTeX ﬁles
logs
tweets
sensor feeds
. . .
9 / 51

Big Data
Unstructured data
Deﬁnition (Unstructured data)
Lacks structure or parts of it lack structure.
Example
multimedia: videos, photos,
audio ﬁles,. . .
email messages
free-form text
word processing documents
presentations
reports
. . .
Experts estimate that 80 to 90 % of the data in any
organization is unstructured.
10 / 51

Big Data
Data Storage and Analysis
Storage capacity of hard drives has increased massively over
the years.
Access speeds have not kept up.
Example (Reading a whole disk)
Year Storage Capacity Transfer Speed Time
1990 1370 MB 4.4 MB/s ≈ 5 minutes
2010 1 TB 100 MB/s > 2.5 hours
Solution: work in parallel!
Using 100 drives (each holding 1/100th of the data),
reading 1 TB takes less than 2 minutes.
11 / 51

Big Data
Working in parallel
Problems
1 Hardware failure?
2 Combining data from diﬀerent disks for analysis?
Solutions
1 HDFS: Hadoop Distributed Filesystem
2 MapReduce: programming model
12 / 51

Big Data
Big Data Technology
Outline
4 IWT TETRA project
5 Conclusions
13 / 51

Big Data
Big Data Technology
Big Data Landscape
14 / 51

Big Data
Big Data Technology
Hadoop
Hadoop is VMware, but the other way around.
15 / 51

Big Data
Big Data Technology
Hadoop as the opposite of a virtual machine
VMware
1 take one physical server
2 split it up
3 get many small virtual
servers
Hadoop
1 take many physical servers
2 merge them all together
3 get one big, massive, virtual
server
16 / 51

Big Data
Big Data Technology
Hadoop: core functionality
HDFS Self-healing, high-bandwidth, clustered storage.
MapReduce Distributed, fault-tolerant resource management,
coupled with scalable data processing.
17 / 51

Big Data
Big Data Technology
HDFS architecture
18 / 51

Big Data
Big Data Technology
MapReduce
19 / 51

Big Data
Big Data Technology
MapReduce
20 / 51

Big Data
Big Data Technology
Apache Hadoop essentials: technology stack
21 / 51

Big Data
Big Data Technology
Pig
MapReduce requires programmers
think in terms of map and reduce
functions,
more than likely use the Java language.
Pig provides a high-level language (Pig
Latin) that can be used by
Analysts
Data Scientists
Statisticians
Etc. . .
22 / 51

Big Data
Big Data Technology
Hive
Originated at Facebook to analyze log data.
HiveQL: Hive Query Language, similar to standard SQL.
Queries are compiled into MapReduce jobs.
Has command-line shell, similar to e.g. MySQL shell.
23 / 51

Big Data
Big Data Technology
Example Hadoop distributions
24 / 51

Big Data
Big Data Technology
NoSQL
25 / 51

Big Data
Big Data Technology
RDBMS: Codd’s 12 rules
Codd’s 12 rules
A set of rules designed to deﬁne what is required from a database
management system in order for it to be considered relational.
Rule 0 The Foundation rule
Rule 1 The Information rule
Rule 2 The guaranteed access rule
Rule 3 Systematic treatment of null values
Rule 4 Active online catalog based on the relational model
. . . . . .
26 / 51

Big Data
Big Data Technology
ACID
ACID
A set of properties that guarantee that database transactions are
processed reliably.
Atomicity A transaction is all or nothing.
Consistency Only transactions with valid data.
Isolation Simultaneous transactions will not interfere.
Durability Written transaction data stays there “forever”
(even in case of power loss, crashes, errors,. . . ).
27 / 51

Big Data
Big Data Technology
Scaling up
What if you need to scale up your RDBMS in terms of
dataset size,
read/write concurrency?
This usually involves
breaking Codds rules,
loosening ACID restrictions,
forgetting conventional DBA wisdom,
loose most of the desirable properties that made RDBMS so
convenient in the ﬁrst place.
NoSQL to the rescue!
28 / 51

Big Data
Big Data Technology
NoSQL
NoSQL
‘Invented’ by Carl Strozzi in 1998 (for his ﬁle-based database)
“Not only SQL”
It’s NOT about
saying that SQL should never be used,
saying that SQL is dead.
29 / 51

Big Data
Big Data Technology
NoSQL databases
Four emerging NoSQL categories:
30 / 51

Big Data
Big Data Technology
Us the right tool for the right job!
http://db-engines.com/
31 / 51

Big Data
Big Data in my company?
Outline
4 IWT TETRA project
5 Conclusions
32 / 51

Big Data
Typical RDBMS scaling story
1. Initial Public Launch
From local workstation → remotely hosted MySQL instance.
2. Service popularity ↑, too many reads hitting the database
Add memcached to cache common queries. Reads are now no
longer strictly ACID; cached data must expire.
3. Popularity ↑↑, too many writes hitting the database
Scale MySQL vertically by buying a beefed-up server:
16 cores
128 GB of RAM
banks of 15 k RPM hard drives



Costly
33 / 51

Big Data
4. New features → query complexity ↑, now too many joins
Denormalize your data to reduce joins.
(Thats not what they taught me in DBA school!)
5. Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.
34 / 51

Big Data
6. Some queries are still too slow
Periodically prematerialize the most complex queries, and try to
stop joining in most cases.
7. Reads are OK, writes are getting slower and slower. . .
Drop secondary indexes and triggers (no indexes?).
If you stay up at night
worrying about your database
(uptime, scale, or speed), you
should seriously consider
making a jump from the
RDBMS world to HBase.
35 / 51

Big Data
Use-cases of Big Data
‘Core Big Data’ company
Big Data
crunching,
hacking,
processing,
analyzing,
. . .
‘General Big Data’ company
Business Analytics
improve decision-making,
gain operational insights,
increase overall
performance,
track and analyze
shopping patterns,
. . .
Both
Explore! Discover hidden gems!
36 / 51

Big Data
Some examples
Intrusion detection based on
server log data
Real-time security analytics
Fraud detection
Customer behavior based
sentiment analysis of social
media
Campaign analytics
37 / 51

Big Data
Big Data in your company
38 / 51

Big Data
IWT TETRA project
Outline
4 IWT TETRA project
5 Conclusions
39 / 51

Big Data
IWT TETRA project
IWT TETRA project
Data mining: van relationele database naar Big Data.
Dates
Submitted: 12/03/2014
Notiﬁcation of acceptance: July, 2014
Runs from 01/10/2014 – 01/10/2016
People involved
Wannes De Smet (researcher)
Bart Vandewoestyne (researcher)
Johan De Gelas (project coordinator)
Interested? → Come talk to us!
40 / 51

Big Data
IWT TETRA project
Project plan, work packages
RDBMS vs.
Distributed
Processing
Technology
Choice
MapReduce &
Alternatives
Big Data
Stack
Analysis
BI
Optimization
Distributed
Processing
Optimization Infrastructure
& Cloud
Analysis
Dissemination
41 / 51

Big Data
IWT TETRA project
WP1: RDBMS vs. Distributed Processing
Key question
When to switch from a ‘traditional’ technology to ‘Big Data’
technology?
Evaluate traditional database systems (Virtuoso, VoltDB,. . . )
Find their limitations.
Strengths? Weaknesses?
42 / 51

Big Data
IWT TETRA project
WP2: Analyse Big Data technology stack
Key idea
Get acquinted with Hadoop and its most important software
components.
Find best way to setup, administer and use Hadoop.
Get familiar with most important software components (Pig,
Hive, HBase,. . . ).
Find out how easy it is to integrate Hadoop into existing
architectures.
43 / 51

Big Data
IWT TETRA project
WP3: Alternatives for MapReduce
Key question
What are valuable alternatives for MapReduce?
Faster querying (compared to Pig & Hive)
Lightning-fast cluster computing
Distributed and fault-tolerant realtime computation
Apache Storm
44 / 51

Big Data
IWT TETRA project
WP4: BI optimization
Key questions
Where can existing BI solutions be optimized?
How can current BI solution interact with Big Data
technology?
Virtuoso, MS SQL
Server 2014, VoltDB,. . .
Apache Sqoop
45 / 51

Big Data
IWT TETRA project
WP5: Distributed Processing optimization
Key question
Where can Big Data technology be performance tuned?
How is the data stored?
Optimal settings for Hadoop, MapReduce,. . .
Benchmarks such as TestDFSIO, TeraSort, NNBench,
MRBench,. . .
46 / 51

Big Data
IWT TETRA project
WP6: Infrastructure & Cloud analysis
Key question
What hardware best ﬁts the (Big Data) needs?
Perform hardware monitoring.
Analyze cloud solutions.
Formulate best practices.
Give advice on hardware choice.
47 / 51

Big Data
IWT TETRA project
WP7: Dissemination & project follow-up
Key idea
Spread the message!
Document case-studies.
Prepare for education.
Presentations at events.
Blogs, articles,. . .
Workshops
48 / 51

Big Data
Conclusions
Outline
4 IWT TETRA project
5 Conclusions
49 / 51

Big Data
Conclusions
Conclusions
“Big” can be small too.
The Big Data landscape is huge.
The right tool for the right job!
We can help → advice, case studies
Your company can beneﬁt from Big Data technology.
Be brave in your quest. . .
50 / 51

Big Data
Conclusions
Questions?
Questions?
johan@sizingservers.be
wannes@sizingservers.be
bart@sizingservers.be
51 / 51

Big Data: an introduction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Big Data: an introduction

Ähnlich wie Big Data: an introduction (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data: an introduction