Big Data Analytics: Architectural Perspective

Big Data Analytics : Architectural
Perspective
Sumit Kalra
Advisor: Prof. TV Prabhakar
Ph.D. State of The Art Seminar
Department of Computer Science and Engineering
Indian Institute of Technology, Kanpur
1 / 72

Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
2 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
3 / 72

Data Explosion
Figure: Exponential growth of digital data1
1Todd Lindeman and Brian Vastag. Rise of the digital information age". In: (2007). url:
http://www.washingtonpost.com/. 4 / 72

One Internet Minute
Figure: Data transaction in one minute on Internet2
2Intel Embedded Developers and Engineers. What Happens In An Internet Minute?" In: Intel Communications
(2011). 5 / 72

Mining Data Mountain
Figure: Mining information from raw data3
3JIA Mom. Escape from Medical Bill Mountain." In: JIA Mom's Blog (2013).
6 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
7 / 72

Outline for Section 2
1 Motivation
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
5 Key Approaches
6 Open Challenges
8 / 72

nition: Software Architecture
Software Architecture
The software architecture of a program or computing system is the
structure or structures of the system, which comprise software elements,
the externally visible properties of those elements, and the relationships
among them.a
aLen Bass, Paul Clements, and Rick Kazman. Software Architecture in Practice. 2nd ed. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc., 2003. isbn: 0321154959.
9 / 72

Example: Software Architecture
4
4Conan. Linux kernel". In: English Wikipedia (). url:
http://en.wikipedia.org/wiki/File:Linux_kernel_map.png.
10 / 72

1 Motivation
'Big' Data
Big Data Eco-System
5 Key Approaches
6 Open Challenges
11 / 72

Data Analytics 1 2 3
Figure: Analytics 1.0, 2.0 and 3.05
5Thomas H. Davenport. The Rise of Analytics 3.0 - How to Compete in the Data Economy". In: International
Institute for Analytics (2013).
12 / 72

1 Motivation
'Big' Data
Big Data Eco-System
5 Key Approaches
6 Open Challenges
13 / 72

t into memory.
Large and complex data sets which are dicult to process with
existing data management tools.
3 Vs of Big Data
According to Gartner's research reporta, 3 attributes of big data are:
Volume
Veracity
Velocity
aDoug Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. In: Research Report,
META Group (now Gartner) (2001).
14 / 72

1 Motivation
'Big' Data
Big Data Eco-System
5 Key Approaches
6 Open Challenges
15 / 72

Architecture and Data
16 / 72

Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
17 / 72

Scalability
Cluster deployment
Performance
parallel
18 / 72

Scalability
Cluster deployment
Performance
parallel
19 / 72

Scalability
Cluster deployment
Performance
parallel
20 / 72

Scalability
Cluster deployment
Performance
parallel
21 / 72

Scalability
Cluster deployment
Performance
parallel
22 / 72

Scalability
Cluster deployment
Performance
parallel
23 / 72

Figure: CAP Theorem6
6Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent Available
Partition-Tolerant Web Services. In: In ACM SIGACT News. 2002, p. 2002.
24 / 72

1 Motivation
'Big' Data
Big Data Eco-System
5 Key Approaches
6 Open Challenges
25 / 72

Big Data Eco-System
Figure: Big Data Eco-System Components
26 / 72

Big Data Eco-System
Figure: Big Data Eco-System Components
27 / 72

Inherent Issues in Big Data Systems
Data is not centralized, memory-resident and static
Bandwidth limitations, Data is generated too fast
In distributed system, each site has limited view to the entire system
Unstructured, semi-structured and structured data
28 / 72

1 Motivation
'Big' Data
Big Data Eco-System
5 Key Approaches
6 Open Challenges
29 / 72

Categorization of Parallel and Distributed System
Shared-Memory systems
Distributed-Memory systems
Hierarchical systems
Paradigm for Parallelism
Data Parallelism: Partitioned database
Task Parallelism: Replicated database
Hybrid Parallelism: Large scale database with huge search space
30 / 72

Categorization of Parallel and Distributed System
Shared-Memory systems
Distributed-Memory systems
Hierarchical systems
Paradigm for Parallelism
Data Parallelism: Partitioned database
Task Parallelism: Replicated database
Hybrid Parallelism: Large scale database with huge search space
31 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
32 / 72

1 Motivation
In-Memory Data Abstraction
Execution Plan Recon

guration at Runtime
5 Key Approaches
6 Open Challenges
7 References
33 / 72

In-Memory Data Abstraction 1/2
Figure: Apache Spark Streaming Architecture7
7Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10.
url: http://dl.acm.org/citation.cfm?id=1863103.1863113.
34 / 72

In-Memory Data Abstraction 2/2
Figure: Apache Spark Performance Comparison8
8Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10.
url: http://dl.acm.org/citation.cfm?id=1863103.1863113.
35 / 72

guration at Runtime
5 Key Approaches
6 Open Challenges
7 References
36 / 72

guration at Runtime 1/2
Figure: Apache Tez Plan Recon

guration at Runtime9
9Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013).
37 / 72

guration at Runtime 2/2
Figure: Apache Tez Performance Comparison10
10Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013).
38 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
39 / 72

1 Motivation
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
40 / 72

Big Data Analytics Tool Timeline
Figure: Timeline
41 / 72

1 Motivation
Timeline
Categorization
5 Key Approaches
6 Open Challenges
42 / 72

Big Data Analytics Tool Categorization
Figure: CMAP - Tool Categorization
43 / 72

1 Motivation
Timeline
Categorization
5 Key Approaches
6 Open Challenges
44 / 72

Processing Time
Figure: Big Data Analytics: Processing Time
45 / 72

1 Motivation
Timeline
Categorization
5 Key Approaches
6 Open Challenges
46 / 72

Programming Model
Figure: Big Data Analytics: Programming Paradigm
47 / 72

1 Motivation
Timeline
Categorization
5 Key Approaches
6 Open Challenges
48 / 72

Service Model
Figure: Big Data Analytics: Service Model
49 / 72

1 Motivation
Timeline
Categorization
5 Key Approaches
6 Open Challenges
50 / 72

Storage Architecture
Figure: Big Data Analytics: Storage Architecture
51 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
52 / 72

Key Approaches
In-Memory Data Abstraction
Minimize Disk I/O
Share data residing in memory between multiple tasks
Partial DAG Execution
Dynamic execution
ow con

guration
Dynamic alteration of query plans based on data statistics collected
at run time.
53 / 72

guration
Dynamic alteration of query plans based on data statistics collected
at run time.
54 / 72

Key Approaches
Lineage Based Fault Recovery
Minimize overhead of data versioning
Increase the re-usability of memory data
Data Co-partitioning
Co-partition two tables on a common keys, which is used together
frequently
'DISTRIBUTE BY' clause in the table creation statement (HDFS)
55 / 72

Key Approaches
Lineage Based Fault Recovery
Minimize overhead of data versioning
Increase the re-usability of memory data
Data Co-partitioning
Co-partition two tables on a common keys, which is used together
frequently
'DISTRIBUTE BY' clause in the table creation statement (HDFS)
56 / 72

cation of Streaming, Batch and Interactive Processing
Signi

cant practical value for developers and analysts
Simple to debug a running computation
Hybrid Storage Architecture
Uses a combination of
ash storage and traditional hard disk drives
Data used most often resides on the faster, high performance
ash
drives and,
Data that just needs to be stored until needed resides on traditional
hard disks
Optimized throughput and power ecient
57 / 72

cant practical value for developers and analysts
Simple to debug a running computation
Hybrid Storage Architecture
Uses a combination of
ash storage and traditional hard disk drives
Data used most often resides on the faster, high performance
ash
drives and,
Data that just needs to be stored until needed resides on traditional
hard disks
Optimized throughput and power ecient
58 / 72

Outline
1 Motivation
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
59 / 72

Big Data Analytics: Architectural Perspective

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Big Data Analytics: Architectural Perspective

Ähnlich wie Big Data Analytics: Architectural Perspective (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Analytics: Architectural Perspective