2. No
single
standard
definiHon…
“Big
Data”
is
data
whose
scale,
diversity,
and
complexity
require
new
architecture,
techniques,
algorithms,
and
analyHcs
to
manage
it
and
extract
value
and
hidden
knowledge
from
it…
Big Data, Definition
5. Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day
6. Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?
10. What
is
driving
Big
Data
Industry?
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
13. Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance
14. Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.
18. MapReduce
Published in 2004 by Google
Popularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
22. Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.
23.
24. Everything
Started
By
Searching
Hadoop was created by
Doug Cutting, the creator
of Apache Lucene, the
widely used text search
library. Hadoop has its
origins in Apache Nutch,
an open source web
search engine, itself a part
of the Lucene project.
27. Hadoop
Distributed
File
System
(HDFS)
-‐
1
HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
on commodity hardware.
-“Very large” in this context means files that are hundreds
of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of
data.
28. Hadoop
Distributed
File
System
(HDFS)
-‐
2
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. The
time to read the whole dataset is more important than the latency
in reading the first record.
29. Hadoop
Distributed
File
System
(HDFS)
-‐
3
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
30. Were
HDFS
doesn't
work
well?
● Low-‐latency
data
access
● Lots
of
small
files
● MulHple
writers,
arbitrary
file
modificaHons.
32. HDFS Concepts - Blocks
65MB 128MB or 256MB Block size.
If the seek time is around 10ms, and the transfer rate is 100 MB/s,
then to make the seek time 1% of the transfer time, we need to
make the block size around 100 MB.
37. Machine Learning - 1
Mahout's
goal
is
to
build
scalable
machine
learning
libraries
providing
core
algorithms
for
clustering,
classificaHon
and
batch
based
collaboraHve
filtering
are
implemented
on
top
of
Apache
Hadoop
using
the
map/reduce
paradigm.
38. Machine Learning - 2
Mahout
can
be
used
as
a
recommender
engine
on
the
top
of
hadoop
clusters.
39. Using
hadoop
for
● ads and recomendations
● online travel
● processing mobile data
● energy savings and discovery
● infrastructure management
● image processing
● fraud detection
● IT security
● health care