Big Data: ce este? ce oportunitati ne ofera? cum il putem folosi? Afli raspunsurile in aceasta prezentare susținută de Pentalog în cadrul evenimentului ALT Festival, organizat de Clusterul pentru Inovare și Tehnologie ALT Brasov.
http://www.altbrasov.org/
2. N o v 2 0 1 4 – B i g d a t a 2 Of 53
2
What is Big Data ?
* Data so large and complex that it becomes difficult to
process with traditional systems
* First time coined in 1997, NASA report
* Petabytes and Exabytes of data
3. N o v 2 0 1 4 – B i g d a t a 3 Of 53
3
Big data is everywhere
* Every 2 days we create as much information as we did from the beginning of time
until 2003
* Google processes over 40 thousand search queries per second, making it
over 3.5 billion in a single day.
* Around 100 hours of video are uploaded to YouTube every minute and it
would take you around 15 years to watch every video uploaded by users in one
day
* Every minute we send 204 million emails, generate 1,8 million Facebook
likes, send 278 thousand Tweets, and upload 200,000 photos to Facebook
* Trillions of sensors monitor, track, communicate with each other , populating
the IoT with realtime data
4. N o v 2 0 1 4 – B i g d a t a 4 Of 53
4
Big data is not new
5. N o v 2 0 1 4 – B i g d a t a 5 Of 53
5
Characteristics
6. N o v 2 0 1 4 – B i g d a t a 6 Of 53
6
Volume
* More data beats == better model
* Scalable storage, and distributed approach to querying
7. N o v 2 0 1 4 – B i g d a t a 7 Of 53
7
Variety
* Big data includes all data
* Data no longer fits into neatly structured tables
8. N o v 2 0 1 4 – B i g d a t a 8 Of 53
8
Velocity
* Frequency at which data is generated, captured , stored and processed
* Need for real-time processing
9. N o v 2 0 1 4 – B i g d a t a 9 Of 53
9
Data sources
10. N o v 2 0 1 4 – B i g d a t a 10 Of 53
10
Importance of Big Data
* Media
* Retailing
* Public service
* Health
* Industry
11. N o v 2 0 1 4 – B i g d a t a 11 Of 53
11
Importance of Big Data
* Gaining a more complete understanding of
business
customers
products
competitors
* Which can lead to
efficiency improvements
increased sales
lower costs
better customer service
improved products
12. N o v 2 0 1 4 – B i g d a t a 12 Of 53
12
The problem
* Overall information available
10% structured data
used in decision making
90% unstructured data
wasted, not captured or analyzed
* Valuable information VS data which is best left ignored
* 37.5% of large organizations said that analyzing big data is their biggest challenge
* More that 90% said that Big Data is a top ten priority
13. N o v 2 0 1 4 – B i g d a t a 13 Of 53
13
It’s not the only the size
* Collect -> Analyze -> Understand -> Generate Value
* Find a meaning
* Find interconnexions
* Find hidden data
14. N o v 2 0 1 4 – B i g d a t a 14 Of 53
14
Purpose
* Take more precise actions that brings value and reduce costs
* Make the right decision within the right amount of time
15. N o v 2 0 1 4 – B i g d a t a 15 Of 53
15
How big will big data get?
* 3.2 zettabytes today to 40 zettabytes in only six years.
* More than 30 billion devices will be wirelessly connected by 2020.
16. N o v 2 0 1 4 – B i g d a t a 16 Of 53
16
Challenges
* Storing data
* Analysis
* Search
* Sharing
* Transfer
* Visualization
17. N o v 2 0 1 4 – B i g d a t a 17 Of 53
17
NoSQL and Big Data Analytics
* Storing data
* Distribution
* Processing
18. N o v 2 0 1 4 – B i g d a t a 18 Of 53
18
NoSQL
* Scalability/ cluster friendly
* Availability/ fault tolerance
* Schema-less
* Low latency
* High performance
* Open-source
19. N o v 2 0 1 4 – B i g d a t a 19 Of 53
19
Dynamic scaling
* adding/removing nodes dynamically
→ storage/performance capacity can grow or shrink as needed
20. N o v 2 0 1 4 – B i g d a t a 20 Of 53
20
Auto-sharding
* Natively and automatically spread data across servers
* Data and query load automatically balanced across servers
21. N o v 2 0 1 4 – B i g d a t a 21 Of 53
21
Replication
* Support automatic replication
→ high availability
→ disaster recovery
→ no need for separate applications to manage these tasks
22. N o v 2 0 1 4 – B i g d a t a 22 Of 53
22
Schemaless
* No predefined schema
* Insertion of aggregates
→ puts together data that is commonly accessed together
23. N o v 2 0 1 4 – B i g d a t a 23 Of 53
23
NoSQL vanillas
24. N o v 2 0 1 4 – B i g d a t a 24 Of 53
24
NoSQL vanillas
* Key-value store
→ Amazon DynamoDB, Redis
→ Content caching (focus on scaling to huge amounts of data, designed to handle
massive load), logging, etc
* Document store
→ CouchDB, MongoDb
→ Web applications
* Column family store
→ Cassandra, HBase
→ Distributed file systems
* Graph store
→ Neo4J, InfoGrid, Infinite Graph
→ Social networking, Recommendations (Focus on modeling the structure of data –
interconnectivity)
25. N o v 2 0 1 4 – B i g d a t a 25 Of 53
25
Reasons for choosing NoSQL
* Working on large amount of data
* Scaling out with ease
* Need of:
→ high-availability
→ low-latency systems with eventual consistency
* Model fits aggregate:
→ as a natural choice
→ structure is changing with time
26. N o v 2 0 1 4 – B i g d a t a 26 Of 53
26
… and associates
27. N o v 2 0 1 4 – B i g d a t a 27 Of 53
27
What is hadoop?
● Distributed file system
● Distributed processing system
● Batch / offline oriented
● Open source
28. N o v 2 0 1 4 – B i g d a t a 28 Of 53
28
In the beginning...
● Created by Doug Cutting and Mike Cafarella
● Inteded as a distribution support for
● Built based on Google's MapReduce and Google File System
● papers
29. N o v 2 0 1 4 – B i g d a t a 29 Of 53
29
Who uses Hadoop?
Most notable users are …
+ many others
30. N o v 2 0 1 4 – B i g d a t a 30 Of 53
30
Hadoop in the real world
● Recommandation system
● Data warehousing
● Financial analysis
● Market research/forecasting
● Log analysis
● Threat analysis
● Image processing
● Social networking
● Advertising
31. N o v 2 0 1 4 – B i g d a t a 31 Of 53
31
Why Hadoop?
● Scalable
● Cost effective
● Flexible
● Efficient
● Resilient to failure
● Schema on read
32. N o v 2 0 1 4 – B i g d a t a 32 Of 53
32
Why not Hadoop?
● Inefficient when used at small scale
● Not good for real time systems
33. N o v 2 0 1 4 – B i g d a t a 33 Of 53
33
Hadoop major components
● Hadoop commons
● YARN
● HDFS
● Map/Reduce
34. N o v 2 0 1 4 – B i g d a t a 34 Of 53
34
Arhitecture
35. N o v 2 0 1 4 – B i g d a t a 35 Of 53
35
Arhitecture
36. N o v 2 0 1 4 – B i g d a t a 36 Of 53
36
Arhitecture
37. N o v 2 0 1 4 – B i g d a t a 37 Of 53
37
Arhitecture
38. N o v 2 0 1 4 – B i g d a t a 38 Of 53
38
Arhitecture
39. N o v 2 0 1 4 – B i g d a t a 39 Of 53
39
MapReduce
● Split input files
● Operate on key/value
● Mappers filter
& transform input data
● Reducers aggregate
mappers output
● Move code to data
41. N o v 2 0 1 4 – B i g d a t a 41 Of 53
41
… and associates
42. N o v 2 0 1 4 – B i g d a t a 42 Of 53
42
Apache Ambari
The project is aimed at making Hadoop management simpler
by developing software for provisioning, managing,
and monitoring Apache Hadoop clusters
43. N o v 2 0 1 4 – B i g d a t a 43 Of 53
43
Apache Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs
44. N o v 2 0 1 4 – B i g d a t a 44 Of 53
44
Apache Hive
The Apache Hive ™ data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Hive provides a mechanism
to project structure onto this data and query the data using a SQL-like language called HiveQL
45. N o v 2 0 1 4 – B i g d a t a 45 Of 53
45
Apache Chukwa
It is a data collection system for monitoring large distributed systems.
Chukwa comes with a flexible and powerful toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
46. N o v 2 0 1 4 – B i g d a t a 46 Of 53
46
Apache Avro
A remote procedure call and data serialization framework
47. N o v 2 0 1 4 – B i g d a t a 47 Of 53
47
Apache Hbase
Apache Hbase offers random, realtime read/write access to your Big Data.
This project's goal is the hosting of very large tables
-- billions of rows X millions of columns -- atop clusters of commodity hardware
48. N o v 2 0 1 4 – B i g d a t a 48 Of 53
48
Apache Mahout
The Apache Mahout™ project's goal is to build a scalable machine learning library
49. N o v 2 0 1 4 – B i g d a t a 49 Of 53
49
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data processing
50. N o v 2 0 1 4 – B i g d a t a 50 Of 53
50
Apache Zookeeper
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing group services.
51. N o v 2 0 1 4 – B i g d a t a 51 Of 53
51
Big data – in the future
● 87% of enterprises believe Big Data analytics will redefine the competitive landscape
of their industries within the next three years
● 89% believe that companies that do not adopt a Big Data analytics strategy in the next
year risk losing market share and momentum.
52. N o v 2 0 1 4 – B i g d a t a 52 Of 53
52
Big data – in the future
1880 – recensamant US
BD - revativ la sistem
- relativ la organizatie
Faptul ca reusim sa generam si sa stocam atat de multa informatie se datoreaza si evolutiei unitatilor de stocare din dpdv al raportului capacitate stocare/pret
Caracteristici...
Volum...
Principala atractie BigData
Un volum mare de date poate crea, in urma analizei, pattern-uri comportamentale/modele mai bune.
Prognoza meteo : 6 vs 300 factori (imagini, loguri satelit + senzori temperatura, presiune in aer si apa, etc)
Distribution
Varietate...
Datele sunt generate in zeci de formate: audio, video, loguri, coord gps , documente, sms, mailuri.
Nu avem control asupra tipurilor de date folosite ca input
Nu putem impune o structura a datelor cu scopul de a avea control asupra analizei
Viteza..
F1 –> senzori-> Tb de info → procesare timp real → ajustare echipare
Retailers → analiza rapida a clickstream-urilor → recomandari
Surse date...
Useri – generatori de date: search, clickstream-uri,
Adictie retele sociale si aparitia/raspandirea smartphonurilor
Web public – imdb, wikipedia, organizatii ce pun la dispozitie data-set-uri mari din diverse domenii
Arhive
Sisteme + Senzori – generare loguri / cei mai mari producatori de date
Fiecare se caracterizeaza prin 3V
Importanta..exemple...
Media -Netflix → producere progr proprii
Public service – Flota curierat → reducere cost mentenanta
Health – retea spitale - > pattern-uri → evolutia unei boli /a starii de sanatate fctie de diversi param
Problema...
Generare valoare ...
Scop...
How big?...
25,000 machines
more than 10 clusters
3 petabytes of data (compressed, unreplicated)
default replication in Hadoop is 3
700+ users
10,000+ jobs/week
In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.[36] On June 13, 2012 they announced the data had grown to 100 PB.[37] On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PB per day.[38]
Hadoop commons
CLI MiniCluster
Native libraries
Security
HDFS
Data is distributed and replicated over multiple machines
Designed for large files (where “large: means GB to TB)
Block oriented
Linux-style commands
e.g. ls, cp, mv, rm, etc.
Replication
Replica placement
Replica selection
Rack awareness
Safemode
Robustness
Rebalancer
Hartbeats
Data integrity
Persistence of metadata
Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce – a programming model for large scale data processing
Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk