Peter Bonz (http://homepages.cwi.nl/~boncz/) describes the challenge that data makes on data management systems. He describes his links to other computer science disciplines within the DSRC and importantly outlines the need to train data scientists.
2. DS
RC
Database Research
Data Mgmt Systems Research
• SIGMOD, TODS, PVLDB, ICDE, VLDBJ
– major industry connections (billion$/y)
Expanding Topic set & Societal Impact
–
–
–
–
–
–
Data Stream Processing
Data Mining
Information Extraction, Text Retrieval
RDF and Graph data management
MapReduce + Cloud
Data Privacy
3. DS
RC
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability
• Computer architecture vs DBMS architecture
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
4. DS
RC
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability
• Computer architecture vs DBMS architecture
– Columnar storage
– Fast Compression Methods
– Differential Storage Techniques (Positional Delta Trees)
– Vectorized Execution
•
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
– Robust Query Execution (“micro adaptivity”)
– Just-In-Time (JIT) Compilation
– Cooperative Scans – sharing scarce I/O bandwidth
5. DS
RC
DB Research Highlights (2/4)
Commodity Cluster Computing - Cloud
• Various MonetDB Cluster Projects
– Shared-nothing data storage, query optimization
• Hadoop VectorWise (VU MSc projects)
– cluster scalability &failover
– Tightly integrated Hadoop/YARN/HDFS
• CWI scilens cluster
– Amdahl number >1
large I/O resources
– Other uses:
webcraw analysis, 500 billion triple BI BSBM benchmark
6. DS
RC
DB Research Highlights (3/4)
Adaptive Indexing
• DBA expertise extremely scarce
• Science workloads hard to predict & variable
Database Cracking:
“every query is an advise how to store the data”
continuous self-steering data reorganization
+ Approximate Query Execution on Samples
+ Recycling – exploit overlap in workloads
+ Fingerprint Indexing – exploit local correlations
7. DS
RC
DB Research Highlights (4/4)
Support for non-tabular data
• Text (retrieval)
• Scientific
– Data vaults: directly query FITS, GeoTIFF,BEM,MSEED,..
– SciQL: Arrays as 1st class database objects
– MonetDB.R: using columns as arrays (and vice versa)
• Semantic Data – RDF
– “automatically discovering schemas in LOD data”
• Bridge gap between RDF and relational
• Graph Data Management
– Benchmark development
8. DS
RC
Application Areas
– Business Intelligence
• Marketing/Sales, Fraud Detection, Churn (spin-offs)
• Social network analysis (LDBC)
– Security
• Digital Forensics (NFI - XIRAF)
• ...
– Science
• Astronomy (LOFAR transient search)
• Meterology (Earthquake Analysis - KNMI)
– Linked Data
• Open government (LOD2)
10. DS
RC
Data Science Education
enormous demand for (“big”) data scientists
• Possibilities/limitations of wide array of techniques
–
–
–
–
Information extraction, cleaning
Ranking, retrieval
Data Mining, and its applications
DB principles (Q-opt, query processing algorithms, storage techniques)
• Understand key performance factors
– Latency vs bandwidth
– Networks, computer architecture
– algorithm optimization techniques
• Practical skills
– Modern Software engineering methods
– Rapid prototyping languages
– Solving problems usin Hadoop clusters
proposal: “Extreme Data Management” MSc course
11. DS
RC
Opportunities: CWI
• Database Architecture Group
– research, application, data science experience
– MonetDB, Vectorwise technologies
– Scilens: data-intensive large compute cluster
• CWI motivators
– Dual Appointments
– Data Science MSc education
• Attracting top students into MSc projects / PhD
– DSRC co-positioning in future research funding
12. DS
RC
Conclusion
• Database research present in Amsterdam
– research, application, valorisation
• Data Science Education!
– Proposal: Extreme data Management course
• ..DSRC and the CWI..