2. what is a Database Management System (DBMS)?
what is a database?
a collection of data
what is a database management system?
... a.k.a. ‘database system’
software to store, access, administer a database
not just a collection of files
provides mechanism to query the data
transfers data between main memory and secondary storage (disk)
enables concurrent access, offers guarantees for data consistency
provides crash recovery mechanisms
provides security and access control
2
4. why use a DBMS?
separate logical from physical data organization
efficient data access
guarantee data integrity and security
reduce application development time
data administration
4
7. consider the following task…
data records that contain information about products viewed or purchased from an online store
task for each pair of Games products, count the number of customers that have purchased both
7
Product Category Customer Date Price Action other...
Portal 2 Games Michael M. 12/01/2015 10€ Purchase
...
FLWR Plant Food Garden Aris G. 19/02/2015 32€ View
Chase the Rabbit Games Michael M. 23/04/2015 1€ View
Portal 2 Games Orestis K. 13/05/2015 10€ Purchase
...
> what challenges does case B pose compared to case A?
hint limited main memory, disk access, distributed setting
case A
10,000 records (0.5MB per record, 5GB total disk space)
10GB of main memory
case B
10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node),
10GB of main memory per node
8. the main message
to manage data efficiently
minimize expensive operations
e.g., disk access
parallelize computation
8
9. why study database systems?
to manage data efficiently ...
... from different roles
• develop database systems that match application requirements
• use database systems efficiently
o … knowing how a DBMS
• stores data,
• processes queries,
• and accesses data
o … allows us to
ü organize data appropriately
ü design efficient algorithms to process the data
• combine existing database systems to match requirements
• large variety of data and applications
• “one size fits none” - Michael Stonebraker
9
10. the relational database system
10
database
(data stored on disk)
DBMS
query optimization & execution
relational operators
files and access methods
buffer (memory) management
disk space management
query interface
database
design
application
database user
introductory course
relational dbms
our course
relational dbms
our course
non-relational dbms
12. modern database systems
beyond the typical relational DBMS setting...
different data models
semi-structured data, unstructured text, graphs
operations at massive scale
big data platforms & map-reduce paradigm,
hadoop and spark, cloud computing
tailored performance
key-value stores, column-stores,
in-memory databases, streaming systems
12
13. about this course
familiarize ourselves with modern database systems
principles and practice
database models: data, queries, and computation
algorithms - simple queries (e.g., joins) to complex algorithms
experience with real technologies
emphasis is on understanding of core issues…
essentially: the cost of algorithms for different database models and settings
you can use what you learn here to:
select a database system that fits the demands of your application...
… based on supported data model, functionality, optimizations, scalability
design your database to fit the needs of your application
… e.g., by building appropriate index structures
write fast algorithms to process your data…
… and estimate their running time
adapt your knowledge to the database system you’ll be using 5 years from now
13
14. syllabus
14
part 1: relational database systems (Jan 22 & Jan 29 - Michael)
topics relational model (SQL), indexing (b+ trees, hash tables), join algorithms, query optimization
technology MySQL
part 2: semi-structured data (Feb 5 - Aris)
topics semi-structured data abstraction, representation, search, indexing and pipeline aggregation
technology MongoDB
part 3: unstructured text and information retrieval (Feb 12 & Feb 26 - Aris)
topics querying text data, inverted indexes, compression, ranking and evaluation, rank aggregation
technology Lucene
part 4a: big data platforms - mapreduce (Mar 4 & Mar 11 - Michael)
topics mapreduce paradigm, algorithms in mapreduce
technology Hadoop
part 4b: big data platforms - graph databases (Mar 18 & Apr 1 - Aris)
topics the pregel paradigm, algorithms on pregel (pagerank, centrality)
technology hadoop giraph, spark graphx
15. logistics
15
instructors Aristides Gionis, Michael Mathioudakis
teaching assistant Orestis Kostakis
lectures Friday 10-12, Room T3
weeks 2 - 6, 8 - 11, 13
office hours by appointment starting on Monday January 25th
curriculum
• slides and course notes
• no single textbook, but slides will provide references for further study
announcements • follow course website on mycourses.aalto.fi
when you send email • aristides.gionis / michael.mathioudakis / orestis.kostakis @ aalto.fi
• subject: [ModernDB] your topic
programming
assignments
• we’ll provide instances of VirtualBox, ready for use
• you can use campus labs or own laptop
• access to CSC, if needed
16. workload & grading scheme
● 3 assignments + exam
● 25% each
● need to have at least 50% on each
○ i.e., cannot skip some of the course
● assignments:
○ pen & paper
■ based on slides + references
○ programming
■ real-world tools, e.g., MySQL, MongoDB, Spark
■ will provide tutorials
16
17. that’s all for now!
questions?
next week
relational model and SQL
indexing
access cost analysis
what to do until then
(if you want)
SQL
17
18. Credits
for some of these slides, we used material from
“Database Systems: The Complete Book”, by Garcia-Mollina, Ullman, Widom
“Database Management Systems”, by Ramakrishnan and Gehrke
18