SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Big Data Analytics : Architectural 
Perspective 
Sumit Kalra 
Advisor: Prof. TV Prabhakar 
Ph.D. State of The Art Seminar 
Department of Computer Science and Engineering 
Indian Institute of Technology, Kanpur 
1 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
2 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
3 / 72
Data Explosion 
Figure: Exponential growth of digital data1 
1Todd Lindeman and Brian Vastag. Rise of the digital information age". In: (2007). url: 
http://www.washingtonpost.com/. 4 / 72
One Internet Minute 
Figure: Data transaction in one minute on Internet2 
2Intel Embedded Developers and Engineers. What Happens In An Internet Minute?" In: Intel Communications 
(2011). 5 / 72
Mining Data Mountain 
Figure: Mining information from raw data3 
3JIA Mom. Escape from Medical Bill Mountain." In: JIA Mom's Blog (2013). 
6 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
7 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
8 / 72
De
nition: Software Architecture 
Software Architecture 
The software architecture of a program or computing system is the 
structure or structures of the system, which comprise software elements, 
the externally visible properties of those elements, and the relationships 
among them.a 
aLen Bass, Paul Clements, and Rick Kazman. Software Architecture in Practice. 2nd ed. Boston, MA, USA: 
Addison-Wesley Longman Publishing Co., Inc., 2003. isbn: 0321154959. 
9 / 72
Example: Software Architecture 
4 
4Conan. Linux kernel". In: English Wikipedia (). url: 
http://en.wikipedia.org/wiki/File:Linux_kernel_map.png. 
10 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
11 / 72
Data Analytics 1 2 3 
Figure: Analytics 1.0, 2.0 and 3.05 
5Thomas H. Davenport. The Rise of Analytics 3.0 - How to Compete in the Data Economy". In: International 
Institute for Analytics (2013). 
12 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
13 / 72
'Big' Data 
Data doesn't
t into memory. 
Large and complex data sets which are dicult to process with 
existing data management tools. 
3 Vs of Big Data 
According to Gartner's research reporta, 3 attributes of big data are: 
Volume 
Veracity 
Velocity 
aDoug Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. In: Research Report, 
META Group (now Gartner) (2001). 
14 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
15 / 72
Architecture and Data 
16 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
17 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
18 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
19 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
20 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
21 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
22 / 72
Data Oriented System Architecture 1/2 
Scalability 
Master-slave architecture 
Peer-to-peer deployment 
Cluster deployment 
At the cost of management overhead 
Performance 
Over the last decade: 
Sequential I/O throughput of commodity SATA drives has increased 
to 60-80 MB/sec from 20 MB/sec 
Disk size increased by a factor of 1,000. 
Reading a terabyte data at this rate requires 4.5 hours. 
To increase throughput, use more smaller disks and read from them in 
parallel 
At the cost of increased power consumption 
23 / 72
Data Oriented System Architecture 2/2 
Figure: CAP Theorem6 
6Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent Available 
Partition-Tolerant Web Services. In: In ACM SIGACT News. 2002, p. 2002. 
24 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
25 / 72
Big Data Eco-System 
Figure: Big Data Eco-System Components 
26 / 72
Big Data Eco-System 
Figure: Big Data Eco-System Components 
27 / 72
Inherent Issues in Big Data Systems 
Data is not centralized, memory-resident and static 
Bandwidth limitations, Data is generated too fast 
In distributed system, each site has limited view to the entire system 
Unstructured, semi-structured and structured data 
28 / 72
Outline for Section 2 
1 Motivation 
2 Software Architecture and Big Data Analytics 
Software Architecture Preliminaries 
Data Analytics Preliminaries 
'Big' Data 
Architectural Perspective 
Big Data Eco-System 
Parallel and Distributed Systems 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
29 / 72
Categorization of Parallel and Distributed System 
Shared-Memory systems 
Distributed-Memory systems 
Hierarchical systems 
Paradigm for Parallelism 
Data Parallelism: Partitioned database 
Task Parallelism: Replicated database 
Hybrid Parallelism: Large scale database with huge search space 
30 / 72
Categorization of Parallel and Distributed System 
Shared-Memory systems 
Distributed-Memory systems 
Hierarchical systems 
Paradigm for Parallelism 
Data Parallelism: Partitioned database 
Task Parallelism: Replicated database 
Hybrid Parallelism: Large scale database with huge search space 
31 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
32 / 72
Outline for Section 3 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
In-Memory Data Abstraction 
Execution Plan Recon
guration at Runtime 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
33 / 72
In-Memory Data Abstraction 1/2 
Figure: Apache Spark Streaming Architecture7 
7Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX 
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10. 
url: http://dl.acm.org/citation.cfm?id=1863103.1863113. 
34 / 72
In-Memory Data Abstraction 2/2 
Figure: Apache Spark Performance Comparison8 
8Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX 
Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10. 
url: http://dl.acm.org/citation.cfm?id=1863103.1863113. 
35 / 72
Outline for Section 3 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
In-Memory Data Abstraction 
Execution Plan Recon
guration at Runtime 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
36 / 72
Execution Plan Recon
guration at Runtime 1/2 
Figure: Apache Tez Plan Recon
guration at Runtime9 
9Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013). 
37 / 72
Execution Plan Recon
guration at Runtime 2/2 
Figure: Apache Tez Performance Comparison10 
10Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013). 
38 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
39 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
40 / 72
Big Data Analytics Tool Timeline 
Figure: Timeline 
41 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
42 / 72
Big Data Analytics Tool Categorization 
Figure: CMAP - Tool Categorization 
43 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
44 / 72
Processing Time 
Figure: Big Data Analytics: Processing Time 
45 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
46 / 72
Programming Model 
Figure: Big Data Analytics: Programming Paradigm 
47 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
48 / 72
Service Model 
Figure: Big Data Analytics: Service Model 
49 / 72
Outline for Section 4 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
Timeline 
Categorization 
Processing Time Based Categorization 
Programming Paradigm Based Categorization 
Service Model Based Categorization 
Storage Architecture Based Categorization 
5 Key Approaches 
6 Open Challenges 
50 / 72
Storage Architecture 
Figure: Big Data Analytics: Storage Architecture 
51 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
52 / 72
Key Approaches 
In-Memory Data Abstraction 
Minimize Disk I/O 
Share data residing in memory between multiple tasks 
Partial DAG Execution 
Dynamic execution 
ow con
guration 
Dynamic alteration of query plans based on data statistics collected 
at run time. 
53 / 72
Key Approaches 
In-Memory Data Abstraction 
Minimize Disk I/O 
Share data residing in memory between multiple tasks 
Partial DAG Execution 
Dynamic execution 
ow con
guration 
Dynamic alteration of query plans based on data statistics collected 
at run time. 
54 / 72
Key Approaches 
Lineage Based Fault Recovery 
Minimize overhead of data versioning 
Increase the re-usability of memory data 
Data Co-partitioning 
Co-partition two tables on a common keys, which is used together 
frequently 
'DISTRIBUTE BY' clause in the table creation statement (HDFS) 
55 / 72
Key Approaches 
Lineage Based Fault Recovery 
Minimize overhead of data versioning 
Increase the re-usability of memory data 
Data Co-partitioning 
Co-partition two tables on a common keys, which is used together 
frequently 
'DISTRIBUTE BY' clause in the table creation statement (HDFS) 
56 / 72
Key Approaches 
Uni
cation of Streaming, Batch and Interactive Processing 
Signi
cant practical value for developers and analysts 
Simple to debug a running computation 
Hybrid Storage Architecture 
Uses a combination of 
ash storage and traditional hard disk drives 
Data used most often resides on the faster, high performance 
ash 
drives and, 
Data that just needs to be stored until needed resides on traditional 
hard disks 
Optimized throughput and power ecient 
57 / 72
Key Approaches 
Uni
cation of Streaming, Batch and Interactive Processing 
Signi
cant practical value for developers and analysts 
Simple to debug a running computation 
Hybrid Storage Architecture 
Uses a combination of 
ash storage and traditional hard disk drives 
Data used most often resides on the faster, high performance 
ash 
drives and, 
Data that just needs to be stored until needed resides on traditional 
hard disks 
Optimized throughput and power ecient 
58 / 72
Outline 
1 Motivation 
2 Software Architecture and Big Data Analytics 
3 Software Architecture in Big Data Analytics 
4 State of The Art Frameworks/Systems 
5 Key Approaches 
6 Open Challenges 
7 References 
8 Appendix 
59 / 72

Weitere ähnliche Inhalte

Was ist angesagt?

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Grid computing 2007
Grid computing 2007Grid computing 2007
Grid computing 2007Tank Bhavin
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social mediaSupriya Radhakrishna
 
Year One Data Stewardship
Year One Data StewardshipYear One Data Stewardship
Year One Data StewardshipAngela Boyd
 
The Modern Data Center Topology
The Modern Data Center TopologyThe Modern Data Center Topology
The Modern Data Center TopologySwagata Mukherji
 
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...Keith Jones
 
11. grid scheduling and resource managament
11. grid scheduling and resource managament11. grid scheduling and resource managament
11. grid scheduling and resource managamentDr Sandeep Kumar Poonia
 
Approved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in CloudApproved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in CloudEditor IJCATR
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time AnalyticsMohsin Hakim
 
The Databases applications in government sections
The Databases applications in government sectionsThe Databases applications in government sections
The Databases applications in government sectionsMonzer Osama Alchikh WARAK
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET Journal
 
New Database and Application Development Technology
New Database and Application Development TechnologyNew Database and Application Development Technology
New Database and Application Development TechnologyMaurice Staal
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 

Was ist angesagt? (20)

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Satellite Volta
Satellite VoltaSatellite Volta
Satellite Volta
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Grid computing 2007
Grid computing 2007Grid computing 2007
Grid computing 2007
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social media
 
Year One Data Stewardship
Year One Data StewardshipYear One Data Stewardship
Year One Data Stewardship
 
The Modern Data Center Topology
The Modern Data Center TopologyThe Modern Data Center Topology
The Modern Data Center Topology
 
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...
Data Center – An All-In-One Application Ecosystem with Big Data Managing Cert...
 
11. grid scheduling and resource managament
11. grid scheduling and resource managament11. grid scheduling and resource managament
11. grid scheduling and resource managament
 
Grid computing
Grid computingGrid computing
Grid computing
 
Approved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in CloudApproved TPA along with Integrity Verification in Cloud
Approved TPA along with Integrity Verification in Cloud
 
Gcc notes unit 1
Gcc notes unit 1Gcc notes unit 1
Gcc notes unit 1
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time Analytics
 
The Databases applications in government sections
The Databases applications in government sectionsThe Databases applications in government sections
The Databases applications in government sections
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
 
New Database and Application Development Technology
New Database and Application Development TechnologyNew Database and Application Development Technology
New Database and Application Development Technology
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 

Andere mochten auch

Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAccenture
 
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...Enrico Palumbo
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analyticsramikaurraminder
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
 
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service Stefan Schwarz
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsNir Rubinstein
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsPethuru Raj PhD
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Thoughtworks
 
Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Dr. Mohan K. Bavirisetty
 
Business Process Maturity and Centers of Excellence
Business Process Maturity and Centers of ExcellenceBusiness Process Maturity and Centers of Excellence
Business Process Maturity and Centers of ExcellenceSandy Kemsley
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Cs2017 gary allemann presentation
Cs2017 gary allemann presentationCs2017 gary allemann presentation
Cs2017 gary allemann presentationGary Allemann
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 

Andere mochten auch (20)

Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
 
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
Innovation Diffusion: a (Big) Data-driven approach to the study of the geogra...
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
 
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
PARTNERS 2013 - Dr. Stefan Schwarz - Big Data Analytics as a Service
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data Analytics
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data Analytics
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
 
Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence
 
Business Process Maturity and Centers of Excellence
Business Process Maturity and Centers of ExcellenceBusiness Process Maturity and Centers of Excellence
Business Process Maturity and Centers of Excellence
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Cs2017 gary allemann presentation
Cs2017 gary allemann presentationCs2017 gary allemann presentation
Cs2017 gary allemann presentation
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 

Ähnlich wie Big Data Analytics: Architectural Perspective

Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologiesgoodfriday
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Architecting Cloudy Applications
Architecting Cloudy ApplicationsArchitecting Cloudy Applications
Architecting Cloudy ApplicationsDavid Chou
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfSoumodeep Nanee Kundu
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 

Ähnlich wie Big Data Analytics: Architectural Perspective (20)

Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologies
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Architecting Cloudy Applications
Architecting Cloudy ApplicationsArchitecting Cloudy Applications
Architecting Cloudy Applications
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Big Data Analytics: Architectural Perspective

  • 1. Big Data Analytics : Architectural Perspective Sumit Kalra Advisor: Prof. TV Prabhakar Ph.D. State of The Art Seminar Department of Computer Science and Engineering Indian Institute of Technology, Kanpur 1 / 72
  • 2. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 2 / 72
  • 3. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 3 / 72
  • 4. Data Explosion Figure: Exponential growth of digital data1 1Todd Lindeman and Brian Vastag. Rise of the digital information age". In: (2007). url: http://www.washingtonpost.com/. 4 / 72
  • 5. One Internet Minute Figure: Data transaction in one minute on Internet2 2Intel Embedded Developers and Engineers. What Happens In An Internet Minute?" In: Intel Communications (2011). 5 / 72
  • 6. Mining Data Mountain Figure: Mining information from raw data3 3JIA Mom. Escape from Medical Bill Mountain." In: JIA Mom's Blog (2013). 6 / 72
  • 7. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 7 / 72
  • 8. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 8 / 72
  • 9. De
  • 10. nition: Software Architecture Software Architecture The software architecture of a program or computing system is the structure or structures of the system, which comprise software elements, the externally visible properties of those elements, and the relationships among them.a aLen Bass, Paul Clements, and Rick Kazman. Software Architecture in Practice. 2nd ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2003. isbn: 0321154959. 9 / 72
  • 11. Example: Software Architecture 4 4Conan. Linux kernel". In: English Wikipedia (). url: http://en.wikipedia.org/wiki/File:Linux_kernel_map.png. 10 / 72
  • 12. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 11 / 72
  • 13. Data Analytics 1 2 3 Figure: Analytics 1.0, 2.0 and 3.05 5Thomas H. Davenport. The Rise of Analytics 3.0 - How to Compete in the Data Economy". In: International Institute for Analytics (2013). 12 / 72
  • 14. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 13 / 72
  • 15. 'Big' Data Data doesn't
  • 16. t into memory. Large and complex data sets which are dicult to process with existing data management tools. 3 Vs of Big Data According to Gartner's research reporta, 3 attributes of big data are: Volume Veracity Velocity aDoug Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. In: Research Report, META Group (now Gartner) (2001). 14 / 72
  • 17. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 15 / 72
  • 19. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 17 / 72
  • 20. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 18 / 72
  • 21. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 19 / 72
  • 22. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 20 / 72
  • 23. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 21 / 72
  • 24. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 22 / 72
  • 25. Data Oriented System Architecture 1/2 Scalability Master-slave architecture Peer-to-peer deployment Cluster deployment At the cost of management overhead Performance Over the last decade: Sequential I/O throughput of commodity SATA drives has increased to 60-80 MB/sec from 20 MB/sec Disk size increased by a factor of 1,000. Reading a terabyte data at this rate requires 4.5 hours. To increase throughput, use more smaller disks and read from them in parallel At the cost of increased power consumption 23 / 72
  • 26. Data Oriented System Architecture 2/2 Figure: CAP Theorem6 6Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services. In: In ACM SIGACT News. 2002, p. 2002. 24 / 72
  • 27. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 25 / 72
  • 28. Big Data Eco-System Figure: Big Data Eco-System Components 26 / 72
  • 29. Big Data Eco-System Figure: Big Data Eco-System Components 27 / 72
  • 30. Inherent Issues in Big Data Systems Data is not centralized, memory-resident and static Bandwidth limitations, Data is generated too fast In distributed system, each site has limited view to the entire system Unstructured, semi-structured and structured data 28 / 72
  • 31. Outline for Section 2 1 Motivation 2 Software Architecture and Big Data Analytics Software Architecture Preliminaries Data Analytics Preliminaries 'Big' Data Architectural Perspective Big Data Eco-System Parallel and Distributed Systems 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 29 / 72
  • 32. Categorization of Parallel and Distributed System Shared-Memory systems Distributed-Memory systems Hierarchical systems Paradigm for Parallelism Data Parallelism: Partitioned database Task Parallelism: Replicated database Hybrid Parallelism: Large scale database with huge search space 30 / 72
  • 33. Categorization of Parallel and Distributed System Shared-Memory systems Distributed-Memory systems Hierarchical systems Paradigm for Parallelism Data Parallelism: Partitioned database Task Parallelism: Replicated database Hybrid Parallelism: Large scale database with huge search space 31 / 72
  • 34. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 32 / 72
  • 35. Outline for Section 3 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics In-Memory Data Abstraction Execution Plan Recon
  • 36. guration at Runtime 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 33 / 72
  • 37. In-Memory Data Abstraction 1/2 Figure: Apache Spark Streaming Architecture7 7Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10. url: http://dl.acm.org/citation.cfm?id=1863103.1863113. 34 / 72
  • 38. In-Memory Data Abstraction 2/2 Figure: Apache Spark Performance Comparison8 8Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10. url: http://dl.acm.org/citation.cfm?id=1863103.1863113. 35 / 72
  • 39. Outline for Section 3 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics In-Memory Data Abstraction Execution Plan Recon
  • 40. guration at Runtime 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 36 / 72
  • 42. guration at Runtime 1/2 Figure: Apache Tez Plan Recon
  • 43. guration at Runtime9 9Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013). 37 / 72
  • 45. guration at Runtime 2/2 Figure: Apache Tez Performance Comparison10 10Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013). 38 / 72
  • 46. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 39 / 72
  • 47. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 40 / 72
  • 48. Big Data Analytics Tool Timeline Figure: Timeline 41 / 72
  • 49. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 42 / 72
  • 50. Big Data Analytics Tool Categorization Figure: CMAP - Tool Categorization 43 / 72
  • 51. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 44 / 72
  • 52. Processing Time Figure: Big Data Analytics: Processing Time 45 / 72
  • 53. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 46 / 72
  • 54. Programming Model Figure: Big Data Analytics: Programming Paradigm 47 / 72
  • 55. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 48 / 72
  • 56. Service Model Figure: Big Data Analytics: Service Model 49 / 72
  • 57. Outline for Section 4 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems Timeline Categorization Processing Time Based Categorization Programming Paradigm Based Categorization Service Model Based Categorization Storage Architecture Based Categorization 5 Key Approaches 6 Open Challenges 50 / 72
  • 58. Storage Architecture Figure: Big Data Analytics: Storage Architecture 51 / 72
  • 59. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 52 / 72
  • 60. Key Approaches In-Memory Data Abstraction Minimize Disk I/O Share data residing in memory between multiple tasks Partial DAG Execution Dynamic execution ow con
  • 61. guration Dynamic alteration of query plans based on data statistics collected at run time. 53 / 72
  • 62. Key Approaches In-Memory Data Abstraction Minimize Disk I/O Share data residing in memory between multiple tasks Partial DAG Execution Dynamic execution ow con
  • 63. guration Dynamic alteration of query plans based on data statistics collected at run time. 54 / 72
  • 64. Key Approaches Lineage Based Fault Recovery Minimize overhead of data versioning Increase the re-usability of memory data Data Co-partitioning Co-partition two tables on a common keys, which is used together frequently 'DISTRIBUTE BY' clause in the table creation statement (HDFS) 55 / 72
  • 65. Key Approaches Lineage Based Fault Recovery Minimize overhead of data versioning Increase the re-usability of memory data Data Co-partitioning Co-partition two tables on a common keys, which is used together frequently 'DISTRIBUTE BY' clause in the table creation statement (HDFS) 56 / 72
  • 67. cation of Streaming, Batch and Interactive Processing Signi
  • 68. cant practical value for developers and analysts Simple to debug a running computation Hybrid Storage Architecture Uses a combination of ash storage and traditional hard disk drives Data used most often resides on the faster, high performance ash drives and, Data that just needs to be stored until needed resides on traditional hard disks Optimized throughput and power ecient 57 / 72
  • 70. cation of Streaming, Batch and Interactive Processing Signi
  • 71. cant practical value for developers and analysts Simple to debug a running computation Hybrid Storage Architecture Uses a combination of ash storage and traditional hard disk drives Data used most often resides on the faster, high performance ash drives and, Data that just needs to be stored until needed resides on traditional hard disks Optimized throughput and power ecient 58 / 72
  • 72. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 59 / 72
  • 73. Open Challenges Distributed Mining Time evolving Data Compression Correctness Debugging Performance Debugging Memory Hierarchy 60 / 72
  • 75. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 62 / 72
  • 76. References I [1] Apache. Apache Incubator. In: Apache Software Foundation (). url: http://incubator.apache.org/. [2] Apache Mahout - scalable machine learning algorithm. url: http://lucene.apache.org/mahout/. [3] Anurag Awasthi et al. Hybrid HBase: Leveraging Flash SSDs to Improve Cost per Throughput of HBase. In: COMAD. 2012, pp. 68{79. [4] Len Bass, Paul Clements, and Rick Kazman. Software Architecture in Practice. 2nd ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2003. isbn: 0321154959. [5] Albert Bifet et al. MOA: Massive Online Analysis. In: J. Mach. Learn. Res. 11 (Aug. 2010), pp. 1601{1604. issn: 1532-4435. url: http://dl.acm.org/citation.cfm?id=1756006.1859903. 63 / 72
  • 77. References II [6] Conan. Linux kernel. In: English Wikipedia (). url: http://en.wikipedia.org/wiki/File:Linux_kernel_map.png. [7] Thomas H. Davenport. The Rise of Analytics 3.0 - How to Compete in the Data Economy. In: International Institute for Analytics (2013). [8] Ewa Deelman et al. Pegasus: A Framework for Mapping Complex Scienti
  • 78. c Work ows Onto Distributed Systems. In: Sci. Program. 13.3 (July 2005), pp. 219{237. issn: 1058-9244. url: http://dl.acm.org/citation.cfm?id=1239649.1239653. [9] Intel Embedded Developers and Engineers. What Happens In An Internet Minute? In: Intel Communications (2011). 64 / 72
  • 79. References III [10] Laszlo Dobos et al. Graywulf: A Platform for Federated Scienti
  • 80. c Databases and Services. In: Proceedings of the 25th International Conference on Scienti
  • 81. c and Statistical Database Management. SSDBM. Baltimore, Maryland: ACM, 2013, 30:1{30:12. isbn: 978-1-4503-1921-8. doi: 10.1145/2484838.2484863. url: http://doi.acm.org/10.1145/2484838.2484863. [11] Amol Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. ICDE '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 231{242. isbn: 978-1-4244-8959-6. doi: 10.1109/ICDE.2011.5767930. url: http://dx.doi.org/10.1109/ICDE.2011.5767930. [12] Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services. In: In ACM SIGACT News. 2002, p. 2002. 65 / 72
  • 82. References IV [13] Mark Hall et al. The WEKA Data Mining Software: An Update. In: SIGKDD Explor. Newsl. 11.1 (Nov. 2009), pp. 10{18. issn: 1931-0145. doi: 10.1145/1656274.1656278. url: http://doi.acm.org/10.1145/1656274.1656278. [14] Michael Isard et al. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In: Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. EuroSys '07. Lisbon, Portugal: ACM, 2007, pp. 59{72. isbn: 978-1-59593-636-3. doi: 10.1145/1272996.1273005. url: http://doi.acm.org/10.1145/1272996.1273005. [15] Tim Kraska et al. MLbase: A Distributed Machine-learning System. In: CIDR. 2013. [16] Doug Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. In: Research Report, META Group (now Gartner) (2001). 66 / 72
  • 83. References V [17] John Langford, Lihong Li, and Alex Strehl. Vowpal Wabbit. 2007. [18] Todd Lindeman and Brian Vastag. Rise of the digital information age. In: (2007). url: http://www.washingtonpost.com/. [19] Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. In: Proc. VLDB Endow. 5.8 (Apr. 2012), pp. 716{727. issn: 2150-8097. url: http://dl.acm.org/citation.cfm?id=2212351.2212354. [20] Antonio Lupher. Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory. MA thesis. EECS Department, University of California, Berkeley, 2014. url: http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS- 2014-1.html. 67 / 72
  • 84. References VI [21] Grzegorz Malewicz et al. Pregel: A System for Large-scale Graph Processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD '10. Indianapolis, Indiana, USA: ACM, 2010, pp. 135{146. isbn: 978-1-4503-0032-2. doi: 10.1145/1807167.1807184. url: http://doi.acm.org/10.1145/1807167.1807184. [22] Nathan Marz. Big data : principles and best practices of scalable realtime data systems. [S.l.]: O'Reilly Media, 2013. isbn: 1617290343 9781617290343. url: http://www.amazon.de/Big- Data-Principles-Practices-Scalable/dp/1617290343. [23] JIA Mom. Escape from Medical Bill Mountain. In: JIA Mom's Blog (2013). [24] R Development Core Team. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0. R Foundation for Statistical Computing. Vienna, Austria, 2008. url: http://www.R-project.org. 68 / 72
  • 85. References VII [25] R Development Core Team. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0. R Foundation for Statistical Computing. Vienna, Austria, 2008. url: http://www.R-project.org. [26] Bikas Saha. Apache Tez: A New Chapter in Hadoop Data Processing. In: Technical blog, Hortonworks (2013). [27] Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC '13. Santa Clara, California: ACM, 2013, 5:1{5:16. isbn: 978-1-4503-2428-1. doi: 10.1145/2523616.2523633. url: http://doi.acm.org/10.1145/2523616.2523633. 69 / 72
  • 86. References VIII [28] Matei Zaharia et al. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud'10. Boston, MA: USENIX Association, 2010, pp. 10{10. url: http://dl.acm.org/citation.cfm?id=1863103.1863113. 70 / 72
  • 87. Outline 1 Motivation 2 Software Architecture and Big Data Analytics 3 Software Architecture in Big Data Analytics 4 State of The Art Frameworks/Systems 5 Key Approaches 6 Open Challenges 7 References 8 Appendix 71 / 72
  • 88. Appendix 1: Big Data Analytics Tools Features 72 / 72