SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Under the hood of 3TU.Datacentrum,
             a repository for research data.
                             abstract




Egbert Gramsbergen
TU Delft Library /
3TU.Datacentrum
e.f.gramsbergen@tudelft.nl




ELAG, 2012-05-17
3TU.Datacentrum
• 3 Dutch TU’s: Delft, Eindhoven, Twente
• Project 2008-2011, going concern 2012-
• Data archive
   –   2008-
   –   “finished” data
   –   preserve but do not forget usability
   –   metadata harvestable (OAI-PMH)
   –   metadata crawlable (OAI-ORE linked data)
   –   data citable (by DataCite DOI’s)
• Data labs
   – Just starting
   – Unfinished data + software/scripts
Technology

• Fedora
     Repository software


• THREDDS / OPeNDAP
     Repository software




                                      ?
http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg
Fedora digital objects

     XML container with “datastreams” containing /
     pointing to (meta)data

     •3 special RDF datastreams
     indexed in triple store
     -> query with REST API / SPARQL




     •Any number of content datastreams



     xml datastreams may be inline,
     other datastreams are on a location managed by Fedora
Fedora Content Model Architecture
Content Model object: links to Service Definition(s)
optionally defines datastreams + mime-types
Service Definition object: defines operations (methods) on data objects
incl parameters + validity constraints
Service Deployment object: implements the methods
Requests are handled by some service whose location is known to the Service Deployment




URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
Fedora API & Saxon xslt2 service
API’s for viewing and manipulating objects
View API (REST, GET method)
     –   findObjects
     –   getDissemination
     –   getObjectHistory
     –   listDatastreams
     –   risearch (query triple store (ITQL, SPARQL))
     –   …

So everything has a url and returns xml
All methods so far have to return xml or (x)html
xslt is a natural fit
(remember: you can easily open secondary documents aka use the REST API)
xslt2.0 is much more powerful than xslt1.0
With Saxon, you can use Java classes/methods from within xslt
(rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
3TU.DC architecture




               Saxon for:
               •html pages
               •rdf for linked data (OAI-ORE)
               •KML for maps
               •Faceted search forms
               •csv, cdl, Excel for datasets
               •xml for indexing by SOLR
               •xml for Datacite
               •xml for PROAI
               •… and more

               Not in picture:
               •PROAI (OAI-PMH service
               provider)
               •DOI registration (Datacite)
3TU.DC architecture [2]
Content Model Architecture and xslt’s in detail
•10 content models
•7 service definition objects with 19 methods
•14 service deployment objects using 32 xslt’s




 Left to right: content models, service deployments, methods aka xslt’s, service definitions
 Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
rdf relations in 3TU.DC




Example relations (namespaces are omitted for brevity)
UI as rdf / linked data viewer


    This dataset         has some
                         metadata
                                      and is part of
                                       this dataset




                   with these
                   metadata
                                    It was calculated
                                    from this dataset

                                                        with these
                                                        metadata


                                             measured by
                                                  this
                                              instrument


                                              with these
                                              metadata
UI as rdf / linked data viewer [2]

Dilemmas - how far will you go?

•Which relations must be expanded?
•How many levels deep?
•Which inverse relations will you show?
•Show repetitions?



Answer: trial and error

Set of rules for each type of relation

Show enough for context but not too much… it’s a delicate balance
Reminder

           What about this
               part?
NetCDF

NetCDF: data format + data model

•Developed by UCAR (University Corporation for Atmospheric Research, USA),
roots at NASA, 1987.
•Comes with set of software tools / interfaces for programming
languages.
•Binary format, but data can be dumped in asci or xml
•Used mainly in geosciences (e.g. climate forecast models)
•BUT: fit for almost any type of numeric data + metadata
•Core data type: multidimensional array


>90% of 3TU.DC data is in NetCDF
NetCDF [2]
Example: T(x,y,z,t) - what can we say in NetCDF?

Variable T (4D array)
Variables x,y,z,t (1D arrays)
Dimensions x,y,z,t
Attributes: creator=‘me’
Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’
        T.name=‘Temperature’, T.error=0.1, etc…
You may invent your own attributes or use conventions (e.g. CF4)


newer NetCDF versions:
•More complex / irregular / nested structures
•built-in compression by variable
boost compression with “leastSignificantDigit=n”
OPeNDAP

OPeNDAP: protocol to talk to NetCDF (and similar) data over internet
THREDDS: server that speaks OPeNDAP

•Internal metadata directly visible on site
•APIs for all main programming languages
•Queries to obtain:
     – cross-sections (slices, blocks)
     – samples (take only 1 in n points)
     – aggregated datasets (e.g. glue together consecutive time series)

       Queries are handled server-side
       (Datafiles in 3TU.DC are up to 100GB)
OPeNDAP python example
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
from pydap.client import open_url
year = '2008'
month = '08'
myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘
  +year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc'
dataset = open_url(myurl) # make connection
print dataset.keys()       # inspect dataset
T = dataset['temperature'] # choose a variable
print T.shape              # inspect the dimensions of this variable
T_red = T[:2000,:150]      # take only a part
T_temp = T_red.array
T_time = T_red.time
T_dist = T_red.distance
mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot
mesh.axes.set_title('water temperature Maisbich [deg C]')
mesh.axes.set_xlabel('distance [m]')
mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]')
mesh.figure.colorbar(mesh)
mesh.figure.savefig('maisbich-'+year+'-'+month+'.png')
mesh.figure.clf()
OPeNDAP catalogs

Datasets are organized in catalogs (catalog.xml)
•Usually (not necessarily) maps to folder
•Contains location, size, date, available services of datasets

Catalogs are our hook to Fedora
catalog.xml  Fedora object
OPeNDAP – Fedora integration
Typical bulk ingest

For predictable data structures (e.g. a 2TB disk with data delivered every 3
month structured in a well-agreed manner):
Bulk ingest from datalab [future?]

Less predictable data structures (e.g. datalab which lifts barrier after
embargo period):
THANK YOU
   QQ?


 data.3tu.nl

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern MiningPrakhar Dhama
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 

Was ist angesagt? (20)

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Data Structure Lec #1
Data Structure Lec #1Data Structure Lec #1
Data Structure Lec #1
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 

Ähnlich wie Elag 2012 - Under the hood of 3TU.Datacentrum.

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014aceas13tern
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastoreTomas Sirny
 
1. Data structures introduction
1. Data structures introduction1. Data structures introduction
1. Data structures introductionMandeep Singh
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
 

Ähnlich wie Elag 2012 - Under the hood of 3TU.Datacentrum. (20)

User biglm
User biglmUser biglm
User biglm
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD Thesis
 
Ado
AdoAdo
Ado
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
ifip2008albashiri.pdf
ifip2008albashiri.pdfifip2008albashiri.pdf
ifip2008albashiri.pdf
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
1. Data structures introduction
1. Data structures introduction1. Data structures introduction
1. Data structures introduction
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 

Kürzlich hochgeladen

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Elag 2012 - Under the hood of 3TU.Datacentrum.

  • 1. Under the hood of 3TU.Datacentrum, a repository for research data. abstract Egbert Gramsbergen TU Delft Library / 3TU.Datacentrum e.f.gramsbergen@tudelft.nl ELAG, 2012-05-17
  • 2. 3TU.Datacentrum • 3 Dutch TU’s: Delft, Eindhoven, Twente • Project 2008-2011, going concern 2012- • Data archive – 2008- – “finished” data – preserve but do not forget usability – metadata harvestable (OAI-PMH) – metadata crawlable (OAI-ORE linked data) – data citable (by DataCite DOI’s) • Data labs – Just starting – Unfinished data + software/scripts
  • 3. Technology • Fedora Repository software • THREDDS / OPeNDAP Repository software ? http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg
  • 4. Fedora digital objects XML container with “datastreams” containing / pointing to (meta)data •3 special RDF datastreams indexed in triple store -> query with REST API / SPARQL •Any number of content datastreams xml datastreams may be inline, other datastreams are on a location managed by Fedora
  • 5. Fedora Content Model Architecture Content Model object: links to Service Definition(s) optionally defines datastreams + mime-types Service Definition object: defines operations (methods) on data objects incl parameters + validity constraints Service Deployment object: implements the methods Requests are handled by some service whose location is known to the Service Deployment URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
  • 6. Fedora API & Saxon xslt2 service API’s for viewing and manipulating objects View API (REST, GET method) – findObjects – getDissemination – getObjectHistory – listDatastreams – risearch (query triple store (ITQL, SPARQL)) – … So everything has a url and returns xml All methods so far have to return xml or (x)html xslt is a natural fit (remember: you can easily open secondary documents aka use the REST API) xslt2.0 is much more powerful than xslt1.0 With Saxon, you can use Java classes/methods from within xslt (rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
  • 7. 3TU.DC architecture Saxon for: •html pages •rdf for linked data (OAI-ORE) •KML for maps •Faceted search forms •csv, cdl, Excel for datasets •xml for indexing by SOLR •xml for Datacite •xml for PROAI •… and more Not in picture: •PROAI (OAI-PMH service provider) •DOI registration (Datacite)
  • 8. 3TU.DC architecture [2] Content Model Architecture and xslt’s in detail •10 content models •7 service definition objects with 19 methods •14 service deployment objects using 32 xslt’s Left to right: content models, service deployments, methods aka xslt’s, service definitions Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
  • 9. rdf relations in 3TU.DC Example relations (namespaces are omitted for brevity)
  • 10. UI as rdf / linked data viewer This dataset has some metadata and is part of this dataset with these metadata It was calculated from this dataset with these metadata measured by this instrument with these metadata
  • 11. UI as rdf / linked data viewer [2] Dilemmas - how far will you go? •Which relations must be expanded? •How many levels deep? •Which inverse relations will you show? •Show repetitions? Answer: trial and error Set of rules for each type of relation Show enough for context but not too much… it’s a delicate balance
  • 12. Reminder What about this part?
  • 13. NetCDF NetCDF: data format + data model •Developed by UCAR (University Corporation for Atmospheric Research, USA), roots at NASA, 1987. •Comes with set of software tools / interfaces for programming languages. •Binary format, but data can be dumped in asci or xml •Used mainly in geosciences (e.g. climate forecast models) •BUT: fit for almost any type of numeric data + metadata •Core data type: multidimensional array >90% of 3TU.DC data is in NetCDF
  • 14. NetCDF [2] Example: T(x,y,z,t) - what can we say in NetCDF? Variable T (4D array) Variables x,y,z,t (1D arrays) Dimensions x,y,z,t Attributes: creator=‘me’ Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’ T.name=‘Temperature’, T.error=0.1, etc… You may invent your own attributes or use conventions (e.g. CF4) newer NetCDF versions: •More complex / irregular / nested structures •built-in compression by variable boost compression with “leastSignificantDigit=n”
  • 15. OPeNDAP OPeNDAP: protocol to talk to NetCDF (and similar) data over internet THREDDS: server that speaks OPeNDAP •Internal metadata directly visible on site •APIs for all main programming languages •Queries to obtain: – cross-sections (slices, blocks) – samples (take only 1 in n points) – aggregated datasets (e.g. glue together consecutive time series) Queries are handled server-side (Datafiles in 3TU.DC are up to 100GB)
  • 16. OPeNDAP python example import urllib import numpy as np import netCDF4 import pydap import matplotlib import matplotlib.pyplot as plt import pylab from pydap.client import open_url year = '2008' month = '08' myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘ +year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc' dataset = open_url(myurl) # make connection print dataset.keys() # inspect dataset T = dataset['temperature'] # choose a variable print T.shape # inspect the dimensions of this variable T_red = T[:2000,:150] # take only a part T_temp = T_red.array T_time = T_red.time T_dist = T_red.distance mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot mesh.axes.set_title('water temperature Maisbich [deg C]') mesh.axes.set_xlabel('distance [m]') mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]') mesh.figure.colorbar(mesh) mesh.figure.savefig('maisbich-'+year+'-'+month+'.png') mesh.figure.clf()
  • 17. OPeNDAP catalogs Datasets are organized in catalogs (catalog.xml) •Usually (not necessarily) maps to folder •Contains location, size, date, available services of datasets Catalogs are our hook to Fedora catalog.xml  Fedora object
  • 18. OPeNDAP – Fedora integration
  • 19. Typical bulk ingest For predictable data structures (e.g. a 2TB disk with data delivered every 3 month structured in a well-agreed manner):
  • 20. Bulk ingest from datalab [future?] Less predictable data structures (e.g. datalab which lifts barrier after embargo period):
  • 21. THANK YOU QQ? data.3tu.nl