Prezentare: Big Data demistificat

N o v 2 0 1 4 – B i g d a t a 2 Of 53
2
What is Big Data ?
* Data so large and complex that it becomes difficult to
process with traditional systems
* First time coined in 1997, NASA report
* Petabytes and Exabytes of data

N o v 2 0 1 4 – B i g d a t a 3 Of 53
3
Big data is everywhere
* Every 2 days we create as much information as we did from the beginning of time
until 2003
* Google processes over 40 thousand search queries per second, making it
over 3.5 billion in a single day.
* Around 100 hours of video are uploaded to YouTube every minute and it
would take you around 15 years to watch every video uploaded by users in one
day
* Every minute we send 204 million emails, generate 1,8 million Facebook
likes, send 278 thousand Tweets, and upload 200,000 photos to Facebook
* Trillions of sensors monitor, track, communicate with each other , populating
the IoT with realtime data

N o v 2 0 1 4 – B i g d a t a 4 Of 53
4
Big data is not new

N o v 2 0 1 4 – B i g d a t a 5 Of 53
5
Characteristics

N o v 2 0 1 4 – B i g d a t a 6 Of 53
6
Volume
* More data beats == better model
* Scalable storage, and distributed approach to querying

N o v 2 0 1 4 – B i g d a t a 7 Of 53
7
Variety
* Big data includes all data
* Data no longer fits into neatly structured tables

N o v 2 0 1 4 – B i g d a t a 8 Of 53
8
Velocity
* Frequency at which data is generated, captured , stored and processed
* Need for real-time processing

N o v 2 0 1 4 – B i g d a t a 9 Of 53
9
Data sources

N o v 2 0 1 4 – B i g d a t a 10 Of 53
10
Importance of Big Data
* Media
* Retailing
* Public service
* Health
* Industry

N o v 2 0 1 4 – B i g d a t a 11 Of 53
11
Importance of Big Data
* Gaining a more complete understanding of
business
customers
products
competitors
* Which can lead to
efficiency improvements
increased sales
lower costs
better customer service
improved products

N o v 2 0 1 4 – B i g d a t a 12 Of 53
12
The problem
* Overall information available
10% structured data
used in decision making
90% unstructured data
wasted, not captured or analyzed
* Valuable information VS data which is best left ignored
* 37.5% of large organizations said that analyzing big data is their biggest challenge
* More that 90% said that Big Data is a top ten priority

N o v 2 0 1 4 – B i g d a t a 13 Of 53
13
It’s not the only the size
* Collect -> Analyze -> Understand -> Generate Value
* Find a meaning
* Find interconnexions
* Find hidden data

N o v 2 0 1 4 – B i g d a t a 14 Of 53
14
Purpose
* Take more precise actions that brings value and reduce costs
* Make the right decision within the right amount of time

N o v 2 0 1 4 – B i g d a t a 15 Of 53
15
How big will big data get?
* 3.2 zettabytes today to 40 zettabytes in only six years.
* More than 30 billion devices will be wirelessly connected by 2020.

N o v 2 0 1 4 – B i g d a t a 16 Of 53
16
Challenges
* Storing data
* Analysis
* Search
* Sharing
* Transfer
* Visualization

N o v 2 0 1 4 – B i g d a t a 17 Of 53
17
NoSQL and Big Data Analytics
* Storing data
* Distribution
* Processing

N o v 2 0 1 4 – B i g d a t a 18 Of 53
18
NoSQL
* Scalability/ cluster friendly
* Availability/ fault tolerance
* Schema-less
* Low latency
* High performance
* Open-source

N o v 2 0 1 4 – B i g d a t a 19 Of 53
19
Dynamic scaling
* adding/removing nodes dynamically
→ storage/performance capacity can grow or shrink as needed

N o v 2 0 1 4 – B i g d a t a 20 Of 53
20
Auto-sharding
* Natively and automatically spread data across servers
* Data and query load automatically balanced across servers

N o v 2 0 1 4 – B i g d a t a 21 Of 53
21
Replication
* Support automatic replication
→ high availability
→ disaster recovery
→ no need for separate applications to manage these tasks

N o v 2 0 1 4 – B i g d a t a 22 Of 53
22
Schemaless
* No predefined schema
* Insertion of aggregates
→ puts together data that is commonly accessed together

N o v 2 0 1 4 – B i g d a t a 23 Of 53
23
NoSQL vanillas

N o v 2 0 1 4 – B i g d a t a 24 Of 53
24
NoSQL vanillas
* Key-value store
→ Amazon DynamoDB, Redis
→ Content caching (focus on scaling to huge amounts of data, designed to handle
massive load), logging, etc
* Document store
→ CouchDB, MongoDb
→ Web applications
* Column family store
→ Cassandra, HBase
→ Distributed file systems
* Graph store
→ Neo4J, InfoGrid, Infinite Graph
→ Social networking, Recommendations (Focus on modeling the structure of data –
interconnectivity)

N o v 2 0 1 4 – B i g d a t a 25 Of 53
25
Reasons for choosing NoSQL
* Working on large amount of data
* Scaling out with ease
* Need of:
→ high-availability
→ low-latency systems with eventual consistency
* Model fits aggregate:
→ as a natural choice
→ structure is changing with time

N o v 2 0 1 4 – B i g d a t a 26 Of 53
26
… and associates

N o v 2 0 1 4 – B i g d a t a 27 Of 53
27
What is hadoop?
● Distributed file system
● Distributed processing system
● Batch / offline oriented
● Open source

N o v 2 0 1 4 – B i g d a t a 28 Of 53
28
In the beginning...
● Created by Doug Cutting and Mike Cafarella
● Inteded as a distribution support for
● Built based on Google's MapReduce and Google File System
● papers

N o v 2 0 1 4 – B i g d a t a 29 Of 53
29
Who uses Hadoop?
Most notable users are …
+ many others

N o v 2 0 1 4 – B i g d a t a 30 Of 53
30
Hadoop in the real world
● Recommandation system
● Data warehousing
● Financial analysis
● Market research/forecasting
● Log analysis
● Threat analysis
● Image processing
● Social networking
● Advertising

N o v 2 0 1 4 – B i g d a t a 31 Of 53
31
Why Hadoop?
● Scalable
● Cost effective
● Flexible
● Efficient
● Resilient to failure
● Schema on read

N o v 2 0 1 4 – B i g d a t a 32 Of 53
32
Why not Hadoop?
● Inefficient when used at small scale
● Not good for real time systems

N o v 2 0 1 4 – B i g d a t a 33 Of 53
33
Hadoop major components
● Hadoop commons
● YARN
● HDFS
● Map/Reduce

N o v 2 0 1 4 – B i g d a t a 34 Of 53
34
Arhitecture

N o v 2 0 1 4 – B i g d a t a 35 Of 53
35
Arhitecture

N o v 2 0 1 4 – B i g d a t a 36 Of 53
36
Arhitecture

N o v 2 0 1 4 – B i g d a t a 37 Of 53
37
Arhitecture

N o v 2 0 1 4 – B i g d a t a 38 Of 53
38
Arhitecture

N o v 2 0 1 4 – B i g d a t a 39 Of 53
39
MapReduce
● Split input files
● Operate on key/value
● Mappers filter
& transform input data
● Reducers aggregate
mappers output
● Move code to data

N o v 2 0 1 4 – B i g d a t a 40 Of 53
40

N o v 2 0 1 4 – B i g d a t a 41 Of 53
41
… and associates

N o v 2 0 1 4 – B i g d a t a 42 Of 53
42
Apache Ambari
The project is aimed at making Hadoop management simpler
by developing software for provisioning, managing,
and monitoring Apache Hadoop clusters

N o v 2 0 1 4 – B i g d a t a 43 Of 53
43
Apache Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs

N o v 2 0 1 4 – B i g d a t a 44 Of 53
44
Apache Hive
The Apache Hive ™ data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Hive provides a mechanism
to project structure onto this data and query the data using a SQL-like language called HiveQL

N o v 2 0 1 4 – B i g d a t a 45 Of 53
45
Apache Chukwa
It is a data collection system for monitoring large distributed systems.
Chukwa comes with a flexible and powerful toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.

N o v 2 0 1 4 – B i g d a t a 46 Of 53
46
Apache Avro
A remote procedure call and data serialization framework

N o v 2 0 1 4 – B i g d a t a 47 Of 53
47
Apache Hbase
Apache Hbase offers random, realtime read/write access to your Big Data.
This project's goal is the hosting of very large tables
-- billions of rows X millions of columns -- atop clusters of commodity hardware

N o v 2 0 1 4 – B i g d a t a 48 Of 53
48
Apache Mahout
The Apache Mahout™ project's goal is to build a scalable machine learning library

N o v 2 0 1 4 – B i g d a t a 49 Of 53
49
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data processing

N o v 2 0 1 4 – B i g d a t a 50 Of 53
50
Apache Zookeeper
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing group services.

N o v 2 0 1 4 – B i g d a t a 51 Of 53
51
Big data – in the future
● 87% of enterprises believe Big Data analytics will redefine the competitive landscape
of their industries within the next three years
● 89% believe that companies that do not adopt a Big Data analytics strategy in the next
year risk losing market share and momentum.

N o v 2 0 1 4 – B i g d a t a 52 Of 53
52
Big data – in the future

Prezentare: Big Data demistificat

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Prezentare: Big Data demistificat

Similar to Prezentare: Big Data demistificat (20)

Recently uploaded

Recently uploaded (20)

Prezentare: Big Data demistificat

Editor's Notes