This document provides an introduction to a course on cloud computing and data intensive computing. The course objectives are to introduce key concepts and principles of cloud computing and data intensive computing, and to teach how to read, review, and present scientific papers. Topics covered in the course include cloud platforms, databases, resource management, and data processing frameworks. Students will complete reading assignments, exams, a presentation, and a final project.
3. Course Objective
Introduction to main concepts and principles of cloud computing
and data intensive computing.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 3 / 48
4. Course Objective
Introduction to main concepts and principles of cloud computing
and data intensive computing.
How to read, review and present a scientific paper.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 3 / 48
5. Topics of Study
Topics we will cover include:
• Cloud platforms
• Cloud storage, NoSQL and NewSQL databases
• Cloud resource management
• Batch processing frameworks
• Stream processing frameworks
• Graph processing frameworks
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 4 / 48
6. Course Material
Mainly based on research papers.
You will find all the material on the course web page:
http://www.sics.se/∼amir/cloud14.htm
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 5 / 48
7. Course Examination
Mid term exam: 20%
Final exam: 20%
Reading assignments: 27%
Final presentation: 23%
Final project: 10%
Lab assignments: 0% :)
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 6 / 48
8. Reading Assignments
Nine reading assignments.
You should write a review for each paper (at most two pages).
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48
9. Reading Assignments
Nine reading assignments.
You should write a review for each paper (at most two pages).
For each paper you should:
• Identify and motivate the problem.
• Pinpoint the main contributions.
• Identify positive/negative aspects of the solution/paper.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48
10. Reading Assignments
Nine reading assignments.
You should write a review for each paper (at most two pages).
For each paper you should:
• Identify and motivate the problem.
• Pinpoint the main contributions.
• Identify positive/negative aspects of the solution/paper.
For each paper, you might be given some questions to answer.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48
11. Reading Assignments
Nine reading assignments.
You should write a review for each paper (at most two pages).
For each paper you should:
• Identify and motivate the problem.
• Pinpoint the main contributions.
• Identify positive/negative aspects of the solution/paper.
For each paper, you might be given some questions to answer.
Students will work in groups of two.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 7 / 48
12. Presentation
Each group give a 30 minutes talk on a scientific paper.
The list of papers will be available in the course web page.
You are also free to choose any other paper, but it should be con-
firmed.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 8 / 48
13. Lab Assignments and the Project
Implement simple applications on different frameworks through the
course.
The solution of each lab assignment will be uploaded on the course
page, one week after their start dates.
The final project on top of Spark.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 9 / 48
14. Discussion Forum
Use the course discussion forum if you have any questions:
http://www.sics.se/∼amir/cloud14.htm
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 10 / 48
16. Data is not information, information is not knowledge, knowledge is
not understanding, understanding is not wisdom.
- Clifford Stoll
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 12 / 48
17. Big Data
small data big data
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 13 / 48
18. Big Data refers to datasets and flows large
enough that has outpaced our capability to
store, process, analyze, and understand.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 14 / 48
19. The Four Dimensions of Big Data
Volume: data size
Velocity: data generation rate
Variety: data heterogeneity
This 4th V is for Vacillation:
Veracity/Variability/Value
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 15 / 48
20. Where Does
Big Data Come From?
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 16 / 48
21. Big Data Market Driving Factors
The number of web pages indexed by Google, which were around
one million in 1998, have exceeded one trillion in 2008, and its
expansion is accelerated by appearance of the social networks.∗
[∗Wei Fan et al., Mining big data: current status, and forecast to the future, 2013]
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 17 / 48
22. Big Data Market Driving Factors
The amount of mobile data traffic is expected to grow to 10.8
Exabyte per month by 2016.∗
[∗Dan Vesset et al., Worldwide Big Data Technology and Services 2012-2015 Forecast, 2013]
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 18 / 48
23. Big Data Market Driving Factors
More than 65 billion devices were connected to the Internet by
2010, and this number will go up to 230 billion by 2020.∗
[∗John Mahoney et al., The Internet of Things Is Coming, 2013]
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 19 / 48
24. Big Data Market Driving Factors
Open source communities
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 20 / 48
25. Big Data Market Driving Factors
Many companies are moving towards using Cloud services to
access Big Data analytical tools.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 21 / 48
26. History of Data
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 22 / 48
27. 4000 B.C
Manual recording
From tablets to papyrus, to parchment, and then to paper
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 23 / 48
29. 1800’s - 1940’s
Punched cards (no fault-tolerance)
Binary data
1890: US census
1911: IBM appeared
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 25 / 48
30. 1940’s - 1950’s
Magnetic tapes
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 26 / 48
31. 1950’s - 1960’s
Large-scale mainframe computers
Batch transaction processing
File-oriented record processing model (e.g., COBOL)
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 27 / 48
32. 1960’s - 1970’s
Hierarchical DBMS (one-to-many)
Network DBMS (many-to-many)
VM OS by IBM → multiple VMs on a single physical node.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 28 / 48
33. 1970’s - 1980’s
Relational DBMS (tables) and SQL
ACID
Client-server computing
Parallel processing
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 29 / 48
34. 1990’s - 2000’s
Virtualized Private Network connections (VPN)
The Internet...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 30 / 48
35. 2000’s - Now
Cloud computing
NoSQL: BASE instead of ACID
Big Data
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 31 / 48
36. Cloud and Big Data
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 32 / 48
37. Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 33 / 48
38. Big Data Analytics Stack
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 34 / 48
39. Hadoop Big Data Analytics Stack
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 35 / 48
40. Spark Big Data Analytics Stack
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 36 / 48
41. Big Data - File systems
Traditional file-systems are not well-designed for large-scale data
processing systems.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48
42. Big Data - File systems
Traditional file-systems are not well-designed for large-scale data
processing systems.
Efficiency has a higher priority than other features, e.g., directory
service.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48
43. Big Data - File systems
Traditional file-systems are not well-designed for large-scale data
processing systems.
Efficiency has a higher priority than other features, e.g., directory
service.
Massive size of data tends to store it across multiple machines in a
distributed way.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48
44. Big Data - File systems
Traditional file-systems are not well-designed for large-scale data
processing systems.
Efficiency has a higher priority than other features, e.g., directory
service.
Massive size of data tends to store it across multiple machines in a
distributed way.
HDFS/GFS, Amazon S3, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 37 / 48
45. Big Data - Database
Relational Databases Management Systems (RDMS) were not de-
signed to be distributed.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48
46. Big Data - Database
Relational Databases Management Systems (RDMS) were not de-
signed to be distributed.
NoSQL databases relax one or more of the ACID properties: BASE
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48
47. Big Data - Database
Relational Databases Management Systems (RDMS) were not de-
signed to be distributed.
NoSQL databases relax one or more of the ACID properties: BASE
Different data models: key/value, column-family, graph, document.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48
48. Big Data - Database
Relational Databases Management Systems (RDMS) were not de-
signed to be distributed.
NoSQL databases relax one or more of the ACID properties: BASE
Different data models: key/value, column-family, graph, document.
Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Volde-
mort, Riak, Neo4J, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 38 / 48
49. Big Data - Resource Management
Different frameworks require different computing resources.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48
50. Big Data - Resource Management
Different frameworks require different computing resources.
Large organizations need the ability to share data and resources
between multiple frameworks.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48
51. Big Data - Resource Management
Different frameworks require different computing resources.
Large organizations need the ability to share data and resources
between multiple frameworks.
Resource management share resources in a cluster between multiple
frameworks while providing resource isolation.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48
52. Big Data - Resource Management
Different frameworks require different computing resources.
Large organizations need the ability to share data and resources
between multiple frameworks.
Resource management share resources in a cluster between multiple
frameworks while providing resource isolation.
Mesos, YARN, Quincy, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 39 / 48
53. Big Data - Execution Engine
Scalable and fault tolerance parallel data processing on clusters of
unreliable machines.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48
54. Big Data - Execution Engine
Scalable and fault tolerance parallel data processing on clusters of
unreliable machines.
Data-parallel programming model for clusters of commodity ma-
chines.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48
55. Big Data - Execution Engine
Scalable and fault tolerance parallel data processing on clusters of
unreliable machines.
Data-parallel programming model for clusters of commodity ma-
chines.
MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 40 / 48
56. Big Data - Query/Scripting Language
Low-level programming of execution engines, e.g., MapReduce, is
not easy for end users.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48
57. Big Data - Query/Scripting Language
Low-level programming of execution engines, e.g., MapReduce, is
not easy for end users.
Need high-level language to improve the query capabilities of exe-
cution engines.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48
58. Big Data - Query/Scripting Language
Low-level programming of execution engines, e.g., MapReduce, is
not easy for end users.
Need high-level language to improve the query capabilities of exe-
cution engines.
It translates user-defined functions to low-level API of the execution
engines.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48
59. Big Data - Query/Scripting Language
Low-level programming of execution engines, e.g., MapReduce, is
not easy for end users.
Need high-level language to improve the query capabilities of exe-
cution engines.
It translates user-defined functions to low-level API of the execution
engines.
Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 41 / 48
60. Big Data - Stream Processing
Providing users with fresh and low latency results.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 42 / 48
61. Big Data - Stream Processing
Providing users with fresh and low latency results.
Database Management Systems (DBMS) vs. Data Stream Man-
agement Systems (DSMS)
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 42 / 48
62. Big Data - Stream Processing
Providing users with fresh and low latency results.
Database Management Systems (DBMS) vs. Data Stream Man-
agement Systems (DSMS)
Storm, S4, SEEP, D-Stream, Naiad, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 42 / 48
63. Big Data - Graph Processing
Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 43 / 48
64. Big Data - Graph Processing
Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
Data-parallel frameworks, such as MapReduce, are not ideal for
these problems: slow
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 43 / 48
65. Big Data - Graph Processing
Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
Data-parallel frameworks, such as MapReduce, are not ideal for
these problems: slow
Graph processing frameworks are optimized for graph-based prob-
lems.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 43 / 48
66. Big Data - Graph Processing
Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
Data-parallel frameworks, such as MapReduce, are not ideal for
these problems: slow
Graph processing frameworks are optimized for graph-based prob-
lems.
Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 43 / 48
67. Big Data - Machine Learning
Implementing and consuming machine learning techniques at scale
are difficult tasks for developers and end users.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 44 / 48
68. Big Data - Machine Learning
Implementing and consuming machine learning techniques at scale
are difficult tasks for developers and end users.
There exist platforms that address it by providing scalable machine-
learning and data mining libraries.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 44 / 48
69. Big Data - Machine Learning
Implementing and consuming machine learning techniques at scale
are difficult tasks for developers and end users.
There exist platforms that address it by providing scalable machine-
learning and data mining libraries.
Mahout, MLBase, SystemML, Ricardo, Presto, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 44 / 48
70. Big Data - Configuration and Synchronization Service
A means to synchronize distributed applications accesses to shared
resources.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 45 / 48
71. Big Data - Configuration and Synchronization Service
A means to synchronize distributed applications accesses to shared
resources.
Allows distributed processes to coordinate with each other.
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 45 / 48
72. Big Data - Configuration and Synchronization Service
A means to synchronize distributed applications accesses to shared
resources.
Allows distributed processes to coordinate with each other.
Zookeeper, Chubby, ...
Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/22 45 / 48