Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Big Data Analytics: Architectural Perspective
1. Big Data Analytics : Architectural
Perspective
Sumit Kalra
Advisor: Prof. TV Prabhakar
Ph.D. State of The Art Seminar
Department of Computer Science and Engineering
Indian Institute of Technology, Kanpur
1 / 72
2. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
2 / 72
3. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
3 / 72
4. Data Explosion
Figure: Exponential growth of digital data1
1Todd Lindeman and Brian Vastag. Rise of the digital information age". In: (2007). url:
http://www.washingtonpost.com/. 4 / 72
5. One Internet Minute
Figure: Data transaction in one minute on Internet2
2Intel Embedded Developers and Engineers. What Happens In An Internet Minute?" In: Intel Communications
(2011). 5 / 72
6. Mining Data Mountain
Figure: Mining information from raw data3
3JIA Mom. Escape from Medical Bill Mountain." In: JIA Mom's Blog (2013).
6 / 72
7. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
7 / 72
8. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
8 / 72
10. nition: Software Architecture
Software Architecture
The software architecture of a program or computing system is the
structure or structures of the system, which comprise software elements,
the externally visible properties of those elements, and the relationships
among them.a
aLen Bass, Paul Clements, and Rick Kazman. Software Architecture in Practice. 2nd ed. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc., 2003. isbn: 0321154959.
9 / 72
11. Example: Software Architecture
4
4Conan. Linux kernel". In: English Wikipedia (). url:
http://en.wikipedia.org/wiki/File:Linux_kernel_map.png.
10 / 72
12. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
11 / 72
13. Data Analytics 1 2 3
Figure: Analytics 1.0, 2.0 and 3.05
5Thomas H. Davenport. The Rise of Analytics 3.0 - How to Compete in the Data Economy". In: International
Institute for Analytics (2013).
12 / 72
14. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
13 / 72
16. t into memory.
Large and complex data sets which are dicult to process with
existing data management tools.
3 Vs of Big Data
According to Gartner's research reporta, 3 attributes of big data are:
Volume
Veracity
Velocity
aDoug Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. In: Research Report,
META Group (now Gartner) (2001).
14 / 72
17. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
15 / 72
19. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
17 / 72
20. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
18 / 72
21. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
19 / 72
22. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
20 / 72
23. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
21 / 72
24. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
22 / 72
25. Data Oriented System Architecture 1/2
Scalability
Master-slave architecture
Peer-to-peer deployment
Cluster deployment
At the cost of management overhead
Performance
Over the last decade:
Sequential I/O throughput of commodity SATA drives has increased
to 60-80 MB/sec from 20 MB/sec
Disk size increased by a factor of 1,000.
Reading a terabyte data at this rate requires 4.5 hours.
To increase throughput, use more smaller disks and read from them in
parallel
At the cost of increased power consumption
23 / 72
26. Data Oriented System Architecture 2/2
Figure: CAP Theorem6
6Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent Available
Partition-Tolerant Web Services. In: In ACM SIGACT News. 2002, p. 2002.
24 / 72
27. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
25 / 72
30. Inherent Issues in Big Data Systems
Data is not centralized, memory-resident and static
Bandwidth limitations, Data is generated too fast
In distributed system, each site has limited view to the entire system
Unstructured, semi-structured and structured data
28 / 72
31. Outline for Section 2
1 Motivation
2 Software Architecture and Big Data Analytics
Software Architecture Preliminaries
Data Analytics Preliminaries
'Big' Data
Architectural Perspective
Big Data Eco-System
Parallel and Distributed Systems
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
29 / 72
32. Categorization of Parallel and Distributed System
Shared-Memory systems
Distributed-Memory systems
Hierarchical systems
Paradigm for Parallelism
Data Parallelism: Partitioned database
Task Parallelism: Replicated database
Hybrid Parallelism: Large scale database with huge search space
30 / 72
33. Categorization of Parallel and Distributed System
Shared-Memory systems
Distributed-Memory systems
Hierarchical systems
Paradigm for Parallelism
Data Parallelism: Partitioned database
Task Parallelism: Replicated database
Hybrid Parallelism: Large scale database with huge search space
31 / 72
34. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
32 / 72
35. Outline for Section 3
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
In-Memory Data Abstraction
Execution Plan Recon
36. guration at Runtime
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
33 / 72
37. In-Memory Data Abstraction 1/2
Figure: Apache Spark Streaming Architecture7
7Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10.
url: http://dl.acm.org/citation.cfm?id=1863103.1863113.
34 / 72
38. In-Memory Data Abstraction 2/2
Figure: Apache Spark Performance Comparison8
8Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10.
url: http://dl.acm.org/citation.cfm?id=1863103.1863113.
35 / 72
39. Outline for Section 3
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
In-Memory Data Abstraction
Execution Plan Recon
40. guration at Runtime
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
36 / 72
45. guration at Runtime 2/2
Figure: Apache Tez Performance Comparison10
10Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013).
38 / 72
46. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
39 / 72
47. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
40 / 72
49. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
42 / 72
50. Big Data Analytics Tool Categorization
Figure: CMAP - Tool Categorization
43 / 72
51. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
44 / 72
53. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
46 / 72
55. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
48 / 72
57. Outline for Section 4
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
Timeline
Categorization
Processing Time Based Categorization
Programming Paradigm Based Categorization
Service Model Based Categorization
Storage Architecture Based Categorization
5 Key Approaches
6 Open Challenges
50 / 72
59. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
52 / 72
60. Key Approaches
In-Memory Data Abstraction
Minimize Disk I/O
Share data residing in memory between multiple tasks
Partial DAG Execution
Dynamic execution
ow con
62. Key Approaches
In-Memory Data Abstraction
Minimize Disk I/O
Share data residing in memory between multiple tasks
Partial DAG Execution
Dynamic execution
ow con
64. Key Approaches
Lineage Based Fault Recovery
Minimize overhead of data versioning
Increase the re-usability of memory data
Data Co-partitioning
Co-partition two tables on a common keys, which is used together
frequently
'DISTRIBUTE BY' clause in the table creation statement (HDFS)
55 / 72
65. Key Approaches
Lineage Based Fault Recovery
Minimize overhead of data versioning
Increase the re-usability of memory data
Data Co-partitioning
Co-partition two tables on a common keys, which is used together
frequently
'DISTRIBUTE BY' clause in the table creation statement (HDFS)
56 / 72
68. cant practical value for developers and analysts
Simple to debug a running computation
Hybrid Storage Architecture
Uses a combination of
ash storage and traditional hard disk drives
Data used most often resides on the faster, high performance
ash
drives and,
Data that just needs to be stored until needed resides on traditional
hard disks
Optimized throughput and power ecient
57 / 72
71. cant practical value for developers and analysts
Simple to debug a running computation
Hybrid Storage Architecture
Uses a combination of
ash storage and traditional hard disk drives
Data used most often resides on the faster, high performance
ash
drives and,
Data that just needs to be stored until needed resides on traditional
hard disks
Optimized throughput and power ecient
58 / 72
72. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
59 / 72
73. Open Challenges
Distributed Mining
Time evolving Data
Compression
Correctness Debugging
Performance Debugging
Memory Hierarchy
60 / 72
75. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
62 / 72
76. References I
[1] Apache. Apache Incubator. In: Apache Software Foundation ().
url: http://incubator.apache.org/.
[2] Apache Mahout - scalable machine learning algorithm. url:
http://lucene.apache.org/mahout/.
[3] Anurag Awasthi et al. Hybrid HBase: Leveraging Flash SSDs to
Improve Cost per Throughput of HBase. In: COMAD. 2012,
pp. 68{79.
[4] Len Bass, Paul Clements, and Rick Kazman. Software Architecture
in Practice. 2nd ed. Boston, MA, USA: Addison-Wesley Longman
Publishing Co., Inc., 2003. isbn: 0321154959.
[5] Albert Bifet et al. MOA: Massive Online Analysis. In: J. Mach.
Learn. Res. 11 (Aug. 2010), pp. 1601{1604. issn: 1532-4435. url:
http://dl.acm.org/citation.cfm?id=1756006.1859903.
63 / 72
77. References II
[6] Conan. Linux kernel. In: English Wikipedia (). url:
http://en.wikipedia.org/wiki/File:Linux_kernel_map.png.
[7] Thomas H. Davenport. The Rise of Analytics 3.0 - How to
Compete in the Data Economy. In: International Institute for
Analytics (2013).
[8] Ewa Deelman et al. Pegasus: A Framework for Mapping Complex
Scienti
78. c Work
ows Onto Distributed Systems. In: Sci. Program.
13.3 (July 2005), pp. 219{237. issn: 1058-9244. url:
http://dl.acm.org/citation.cfm?id=1239649.1239653.
[9] Intel Embedded Developers and Engineers. What Happens In An
Internet Minute? In: Intel Communications (2011).
64 / 72
79. References III
[10] Laszlo Dobos et al. Graywulf: A Platform for Federated Scienti
80. c
Databases and Services. In: Proceedings of the 25th International
Conference on Scienti
81. c and Statistical Database Management.
SSDBM. Baltimore, Maryland: ACM, 2013, 30:1{30:12. isbn:
978-1-4503-1921-8. doi: 10.1145/2484838.2484863. url:
http://doi.acm.org/10.1145/2484838.2484863.
[11] Amol Ghoting et al. SystemML: Declarative Machine Learning on
MapReduce. In: Proceedings of the 2011 IEEE 27th International
Conference on Data Engineering. ICDE '11. Washington, DC, USA:
IEEE Computer Society, 2011, pp. 231{242. isbn:
978-1-4244-8959-6. doi: 10.1109/ICDE.2011.5767930. url:
http://dx.doi.org/10.1109/ICDE.2011.5767930.
[12] Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the
Feasibility of Consistent Available Partition-Tolerant Web Services.
In: In ACM SIGACT News. 2002, p. 2002.
65 / 72
82. References IV
[13] Mark Hall et al. The WEKA Data Mining Software: An Update.
In: SIGKDD Explor. Newsl. 11.1 (Nov. 2009), pp. 10{18. issn:
1931-0145. doi: 10.1145/1656274.1656278. url:
http://doi.acm.org/10.1145/1656274.1656278.
[14] Michael Isard et al. Dryad: Distributed Data-parallel Programs
from Sequential Building Blocks. In: Proceedings of the 2Nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007.
EuroSys '07. Lisbon, Portugal: ACM, 2007, pp. 59{72. isbn:
978-1-59593-636-3. doi: 10.1145/1272996.1273005. url:
http://doi.acm.org/10.1145/1272996.1273005.
[15] Tim Kraska et al. MLbase: A Distributed Machine-learning
System. In: CIDR. 2013.
[16] Doug Laney. 3D Data Management: Controlling Data Volume,
Velocity and Variety. In: Research Report, META Group (now
Gartner) (2001).
66 / 72
83. References V
[17] John Langford, Lihong Li, and Alex Strehl. Vowpal Wabbit. 2007.
[18] Todd Lindeman and Brian Vastag. Rise of the digital information
age. In: (2007). url: http://www.washingtonpost.com/.
[19] Yucheng Low et al. Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the Cloud. In: Proc. VLDB
Endow. 5.8 (Apr. 2012), pp. 716{727. issn: 2150-8097. url:
http://dl.acm.org/citation.cfm?id=2212351.2212354.
[20] Antonio Lupher. Shark: SQL and Analytics with Cost-Based Query
Optimization on Coarse-Grained Distributed Memory. MA thesis.
EECS Department, University of California, Berkeley, 2014. url:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-
2014-1.html.
67 / 72
84. References VI
[21] Grzegorz Malewicz et al. Pregel: A System for Large-scale Graph
Processing. In: Proceedings of the 2010 ACM SIGMOD
International Conference on Management of Data. SIGMOD '10.
Indianapolis, Indiana, USA: ACM, 2010, pp. 135{146. isbn:
978-1-4503-0032-2. doi: 10.1145/1807167.1807184. url:
http://doi.acm.org/10.1145/1807167.1807184.
[22] Nathan Marz. Big data : principles and best practices of scalable
realtime data systems. [S.l.]: O'Reilly Media, 2013. isbn:
1617290343 9781617290343. url: http://www.amazon.de/Big-
Data-Principles-Practices-Scalable/dp/1617290343.
[23] JIA Mom. Escape from Medical Bill Mountain. In: JIA Mom's
Blog (2013).
[24] R Development Core Team. R: A Language and Environment for
Statistical Computing. ISBN 3-900051-07-0. R Foundation for
Statistical Computing. Vienna, Austria, 2008. url:
http://www.R-project.org.
68 / 72
85. References VII
[25] R Development Core Team. R: A Language and Environment for
Statistical Computing. ISBN 3-900051-07-0. R Foundation for
Statistical Computing. Vienna, Austria, 2008. url:
http://www.R-project.org.
[26] Bikas Saha. Apache Tez: A New Chapter in Hadoop Data
Processing. In: Technical blog, Hortonworks (2013).
[27] Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another
Resource Negotiator. In: Proceedings of the 4th Annual
Symposium on Cloud Computing. SOCC '13. Santa Clara,
California: ACM, 2013, 5:1{5:16. isbn: 978-1-4503-2428-1. doi:
10.1145/2523616.2523633. url:
http://doi.acm.org/10.1145/2523616.2523633.
69 / 72
86. References VIII
[28] Matei Zaharia et al. Spark: Cluster Computing with Working Sets.
In: Proceedings of the 2Nd USENIX Conference on Hot Topics in
Cloud Computing. HotCloud'10. Boston, MA: USENIX Association,
2010, pp. 10{10. url:
http://dl.acm.org/citation.cfm?id=1863103.1863113.
70 / 72
87. Outline
1 Motivation
2 Software Architecture and Big Data Analytics
3 Software Architecture in Big Data Analytics
4 State of The Art Frameworks/Systems
5 Key Approaches
6 Open Challenges
7 References
8 Appendix
71 / 72