Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Lecturer: Frank Kienle, Head of AI and Data Science, Camelot ITLab
Topic: introduction to data bases
3. Overview of data sources
â˘âŻ http://www.knuggets.com/datasets/index.html
Machine learning data
â˘âŻ UCI Machine Learning Repository: archive.ics.uci.edu
Data Shop: the worldâs largest repository of learning interaction data
â˘âŻ https://pslcdatashop.web.cmu.edu
Getting Data is not the problem
- Very large ďŹavor of Data Sources
06.09.17 Frank Kienle 3
4. â˘âŻ Formally, a "database" refers to a set of related data and the way it is organized.
â˘âŻ A database manages data efďŹciently and allows users to perform multiple tasks
with ease. The efďŹcient access to the data is usually provided by a "database
management system" (DBMS)
â˘âŻ A database management system stores, organizes and manages a large amount
of information within a single software application.
â˘âŻ Use of this system increases efďŹciency of business operations and reduces
overall costs.
â˘âŻ Different database systems exist which are designed with respect to:
â˘âŻ the data to be stored in the database
â˘âŻ the relationships between the different data elements. Dependencies within the data which can
be modeled by mathematical relations
â˘âŻ the logical structure upon the data on the basis of these relationships. The goal is to arrange
the data into a logical structure which can then be mapped into the storage objects
Database
06.09.17 Frank Kienle p. 4
6. Scale up: using more and more main memory
Scale out: using more and more computers
DeďŹnition (m complexity order):
Scalability for N data items an algorithms scales with Nm.
E.g polynomial complexity
Parallelize it (k nodes): The algorithm scales with Nm/k
Goal ďŹnd algorithms with complexity: N log(N) which relates e.g. with trees (one
touch)
Scalability in big data
06.09.17 6Frank Kienle
7. CAP theorem
06.09.17 Frank Kienle p. 7
C: consistency
(do all applications see all the same data)
Any data written to the database must be valid
According to all defined rules
A: availability
(can I interact with the system
In the presence of failures)
P: partitioning
If two sections of your system cannot talk to each
Other, can they make forward progress on their own
-⯠If not you sacrifice availability
-⯠If so, you might have to sacrifice consistency
Dynamo
Riak
Voldemort
Cassandra
CouchDB
Bigtable
Hbase
Hypertable
Megastore
Spanner
Accumulo
RDBMS
9. Relational data bases key idea:
§⯠storage and retrieval of large quantities of related data.
§⯠When creating a database you should think about which tables needed and
what relationships exist between the data in your tables.
§⯠Relational algebra,
§⯠Physical/logical data independence
Think about the design in advance
Relational Data Bases
06.09.17 Frank Kienle p. 9
10. A database is created for the storage and retrieval of data.
we want to be able to INSERT data into the database and we want to be able to
SELECT data from the database.
A database query language was invented for these tasks called the Structured
Query Language,
Structured query language (SQL)
06.09.17 Frank Kienle p. 10
11. When you can do JOINâs its good for analytics
When a data base does not provide joins the work is it is all up for the users
(Leave the work on the client side)
Fundamental of data exploring (joins)
06.09.17 Frank Kienle p. 11
12. Outer Relational Join (on time stamp)
06.09.17 Frank Kienle p. 12
Time stamp [s] Value room
[Wa2]
1 30
2 25
5 12
Time stamp [s] Value Home
[Wa2]
1 100
2 78
3 99
4 70
Time stamp [s] Value Room
[Wa2|
Value Home
[Wa2]
1 30 100
2 25 78
3 NaN 99
4 NaN 70
5 12 NaN
13. Left Join (on time stamp)
06.09.17 Frank Kienle p. 13
Time stamp [s] Value room
[Wa2]
1 30
2 25
5 12
Time stamp [s] Value Home
[Wa2]
1 100
2 78
3 99
4 70
Time stamp [s] Value Room
[Wa2|
Value Home
[Wa2]
1 30 100
2 25 78
5 12 NaN
14. Storing data efďŹciently is all about the application
schema less vs. schema
writing centric vs. reading centric
transactional vs. analytics
batch vs. stream
15. Key-Value object
â˘âŻ A set of key-value pairs
Extensible record (XML or JSON)
â˘âŻ Families of attributes have a schema
â˘âŻ New attributes may be added
â˘âŻ Many predictive analytics tasks will require a kind of record
â˘âŻ Many REST APIs will deliver JSON, (YAML, XML) structures
â˘âŻ Example: tweeter feeds
Key Value stores (Document store might be a subset)
â˘âŻ No schema, no exposed nesting
â˘âŻ often raw data (scalable to peta bytes)
â˘âŻ on top simple analytics tasks
Different data structure
06.09.17 Frank Kienle p. 15
45777
Ux_78
321-87
Frank Kienle, Germany
Please learn
Random data
key value
18. The ability to replicate and partition data over many serves
â˘âŻ Sharding: horizontal partitioning of the data set
No query language: a simple API deďŹned
Ability to scale operations over many serves
â˘âŻ Throughput increase
â˘âŻ Due to missing (language) query layer each operation has to design towards the API
Operations have often restrictions to data locality
New features can be added dynamically to data records (no ďŹxed schema)
Consistency model often weak (no modeling of transaction)
(typical) NoSQL data base features
06.09.17 Frank Kienle p. 18
19. In-memory database
â˘âŻ primarily relies on main memory for computer data storage
â˘âŻ main purpose is faster analytics on data
â˘âŻ relational or unstructured data structure
â˘âŻ memory optimized data structures
Main memory database system (MMDB)
06.09.17 Frank Kienle p. 19
20. Advantage Column-oriented:
â˘âŻ Reading efďŹciency: more efďŹcient when an aggregate needs to be computed over
many rows but only for a notably smaller subset of all columns of data
select col_1,col_2 from table where col_2>5 and col_2<45;
â˘âŻ Writing efďŹciency: more efďŹcient when new values of a column are supplied for
all rows at once
Advantage row-oriented:
â˘âŻ Reading efďŹciency: more efďŹcient when many columns of a single row are
required at the same time, and when row-size is relatively small
â˘âŻ Writing efďŹciency: more efďŹcient when writing a new row if all of the row data is
supplied at the same time, as the entire row can be written with a single disk
seek.
Row vs. Column data stores
06.09.17 Frank Kienle p. 20
21. Processing types
06.09.17 Frank Kienle p. 21
OLTP: On-line Transaction Processing
e.g. Business transactions
(insert, update, delete)
OLAP: On-line Analytical Processing
e.g. complex analytics
(aggregating of historical data)
22. for data analytics a column oriented
in-memory data base is a must have
06.09.17 Frank Kienle p. 22
23. Spanner Idea: Planet scale data base system
âŚ.we believe it is better to have application programmers deal with performance
problems due to overuse of transactions as bottlenecks arise, rather than always coding
around the lack of transactions âŚ
Loose consistency for predictive analytics is horrible
Loose consistency is a no go for prescriptive analytics (dynamic pricing)
Systems should always be designed for usability
Many trends in data bases are going back to data
consistency
06.09.17 Frank Kienle p. 23