SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
CLIO INFRA
Data analysis in Dataverse & visualization
of datasets on historical maps
Dataverse Community meeting 2015,
June 11, Harvard University
Vyacheslav Tykhonov Richard Zijdeman Jerry de Vries
International Institute of Social History
Introduction
The International Institute of Social History (IISH) collects information in the field of social
history and makes it available to the public. The IISH is one of the major scientific heritage
institutions in the Netherlands. In the field of socio-economic historical research IISH plays a
prominent role.
CLIO INFRA is Digital Research Infrastructure for the Arts and Humanities developed to
integrate a number of databases (hubs) consisting of data on global social, economic and
institutional indicators over the past five centuries, with special attention for the past 200
years. Clio Infra developed by IISH.
Our mission
Clio Infra is the bridge between Data Collections and Research Datasets.
“Old school” researchers time:
Collecting datasets in CSV, Access or Excel and other formats mostly on his own without real collaboration
between all researchers in the same field. Very closed model.
Digital Humanities era:
Young researchers use various computer tools to collect, analyze and combine research datasets they’ve
got from older generation of researchers. They can contribute and help each other and build communities.
The future:
The goal is to build tools to get datasets from “old school” researchers, standardize and store data in the
Clouds in order provide access to Open Data for researchers all over the world and made possible
collaboration between them.
Clio Infra Collaboration platform
Clio Infra functionality based on the Dataverse solution:
- teams collaboratively can curate, share and analyze research datasets
- teams members can share the responsibility to collect data on specific variables (for
example, countries) and inform each other about changes and additions
- dataset version control system is able to track changes in datasets
- other researchers can download their own copy of the data if dataset is published as Open
Data
Dataverse is flexible metadata store (repository) that connected with Research datasets
storage by our Data Processing Engine (DPE)
Added Value for future Researchers
The benefits of data sharing can be classified in terms of Metadata and Data access and
sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools):
● access to a specific case study, citing and finding data
● access to the universe of data from Dataverse network that can organize and display
them for browsing and searching
● data filtering: researchers with proper authorization can obtain the subset of data
provided by data collector
● data analysis to run descriptive statistics and graphics, visualization, plotting on
historical maps
● Data APIs to export data for further analysis by popular statistical packages (STATA,
SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in
the future (always up-to-date solution)
Collaboration possibilities
Descriptive metadata:
- dataset file
- documentation
- standardized tables with codes can be part of dataset
Sharing and collaboration capabilities:
- requesting unique API token by every researcher - user of Dataverse
- exchanging API token between researchers and granting permissions to work on the
same datasets as a team using data analysis tools
Collaboration Data Workflow
- Draft Datasets are visible only for owners
- with API tokens researchers can get access to interactive dashboard to get some
insights about the data stored in the Dataverse
- every dataset converted to dataframe
- dashboard can provide access to all variables from the dataframe and visualize them
on charts, graphs, historical maps and treemaps
After dataset is prepared to go public, it can be published in Collabs:
- guest users can download the copy of dataset
- team members with permissions and authorized API token can contribute to dataset
Data Analysis in Clio Infra
● Data Processing Engine has python core and developed for nlgis.nl (Netherlands
Geographic Information System) and distributed as widget
● interactive data exploration dashboard based on D3 library
● every dataset from Dataverse is available as Data API (json) and can be connected to
any statistical package
● filtering the data on specific years and variables based on dataframes (pandas, python
for data analysis)
● data quality check is the quick visual tool to apply Benford’s law for all values from
specific dataset
● dataframes can be open by researchers in various statistical packages like iPython
notebook, R Studio, SPSS, STATA, Mathlab, etc
Data Processing Engine (DPE) specification
● can split values from any dataset in number of categories specified by researcher (8 by
default)
● algorithm to categorize data values in proper categories can be selected manually
(percentile by default)
● can define maximum possible categories for specific dataset if there is no way to get
categories number specified by user of the system (for example, if there are 2-3
categories of data values)
● data ranges should be defined to get possibility to visualize data on some chart or map
in the right scale
● colors can be specified by user (Color Brewing, see http://colorbrewer2.org)
● legend generated and attached to all visualizations automatically
● values with missing data shown as 'no data' regions on map
● all data values delivered by Data API to make the data analysis platform independent
and communicate with other systems or statistical packages
API Service (Data API)
Data API provided by Data Processing Engine is the most important functionality for the well
equipped digital infrastructure:
● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel)
● use common data science programming languages like Python, R to perform more
advanced research using external Data Science libraries
● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added
value for the future)
● suitable for other researchers and developers to use advanced technique and data
mining tools that aren’t developed yet
Example of output from Data API
● every dataset ingested by DPE available as Data API with unique handler
● API can be filtered by variables extracted from the content of data file
Example:
/api/data?&handle=F16UDU:30:31&countrycode=USA&year=1880&categories=8&datarange=calculate
"United States of America": {
"code": "F16UDU30_31",
"color": "#FF7F00",
"countrycode": "USA",
"id": "2085",
"indicator": "Total Urban Population",
"intcode": "840",
"r": 923.99,
"region": "W. Offshoots",
"units": "x 1000",
"value": 14264.0,
"year": 1880
}
}
Data visualization and plotting data on historical maps
● Data Processing Engine (DPE) is the core of data visualization process and connected
to geoservice
● data attributes like scales and colors calculated by DPE on the fly based on the input of
researcher (for example, number of categories to split data) and the part of Data API
● histograms, cross tabulations, enhanced descriptive statistics based on pandas
dataframes
● visualization of datasets on historical maps will be available to plot data on maps for
last 500 years
Historical maps services: Geoservice and Geocoder
Delivered by Webmapper.nl and integrated with common Clio Infra infrastructure.
Geoservice basic requirements:
- should provide actual GeoAPI with historical polygons for all countries and regions
based on standardized codes
- available as geojson/topojson for online visualization
- QGIS toolbox should be supported to upload new maps and update old maps as shape
files
Geocoder should be able to standardize all geographical locations in different datasets:
- USSR and Soviet Union should be recognized as one country with the same PID
- Germany before and after 1990
- Indonesia before and after 1999
GeoAPI example
Geoservice can provide polygons for specific years on the national or world level rendered
as topojson or geojson.
GeoAPI:
/api/maps?world=on&year=1962
Polygons for all countries will be delivered as topojson:
arcs":[[1782,2186]]}]}},"arcs":[[[8387,6231],[0,5],[1,1],[1,-1],[2,0],[2,-1],[3,-4],[1,
-3],[0,-1],[-1,-5],[0,-1],[-1,2],[0,1],[-2,2],[-3,1],[-1,0],[-1,0],[-1,3],[0,1]],
[[8390,6247],[1,1],[0,1],[2,1],[1,0],[1,-2],[-1,-5],[-1,0],[-1,1],[-1,1],[-1,1],[0,1]],
[[8391,6204],[0,2],[-1,1],[-1,-1],[0,1],[0,1],[1,3],[1,0],[2,-6],[0,-1],[0,-1],[0,-1],
[-1,1],[-1,1]],[[8364,6093],[0,2],[2,5],[1,0],[1,-2],[-1,-6],[-1,-3],[-1,0],[-1,0],[0,1],
[0,3]],[[5941,6575],[0,-1],[-1,0],[-1,1],[0,1],[-1,0],[-1,0],[-1,0],[-1,0],[0,-1],[-1,-1],
[0,-2],[0,-1],[0,-2],[0,-1],[-1,-3],[-3,-2],[-4,-4],[-1,-1],[-1,0],[-2,-1],[-2,0],[-1,0],[-1,
-1],[-1,-2],[-5,1],[-1,0],[-1,-1],[-1,-1],[-1,0],[-2,1],[-4,3],[-1,1],[-1,2],[-2,8],[-1,4],
[-1,9],[0,1],[0,2],[0,1],[1,0],[0,-1],[0,-1],[1,-1],[0,-1],[1,0]
Integration of Dataverse, DPE, geoservice and geocoder in 6 steps
1. Data API communicates with geocoder to find standardized codes for all locations in
dataset
2. the same geocode with associated polygons is coming from geoservice
3. data and geoservice matching in the frontend by any geospatial javascript library
4. attributes like colors and scales filling the map polygons
5. legend with scales, values and colors coming from DPE to make map look complete
6. notes, sources and other metadata extracting from dataverse API to provide clear
explanation of generated historical map
Live demo for Labor Conflicts in 100 years
Dataset Quality Check
Benford’s law to do test of the quality of data
Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the
Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled
data values (for example, characters in values where numbers expected)
Chart visualization to get overview of missing data values
Benford’s law in action on real dataset
Datasets combining and aggregation in data exploration tools
● tools that can predict the same variables disambiguated in different datasets (Year and
Jaar)
● geocoding services can standardize different regions (Netherlands → NL, Amsterdam
→ AMS)
● all possible relationship paths can be ranked from the “best” (100%) to the “worst”
(0%), for example, value “05” for variable “Month” can be recognized as “May”
● standardized datasets can be used as “reference” data for other datasets from other
researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT,
DANS)
Clio Infra Infrastructure is suitable for different kinds of datasets
Quantitative datasets that store quantity data values (numbers):
- data measured (length, speed, height, age)
- example: current clio-infra datasets
Qualitative datasets store quality observations (descriptions):
- data can be observed (colors, textures, professions, groups)
- example: HISCO, world strikes dataset
Data Visualization is different for different kinds of datasets:
- quantity can be plotted on charts, graphs, historical maps
- quality (hierarchy) can be visualized on treemaps and maps
Example: Interactive data exploration based on API token
Summary: data exploration and analysis will extend Dataverse functionality
● data quality check during upload/update of dataset
● automatic ingestion and recognition of years and locations in datasets
● integrated with geocoder and geoservice to get polygons for maps
● interactive dashboard to do visual exploration of variables from dataset (graph /
histogram / scatterplot)
● data processing engine to plot data on interactive historical maps (if dataset has
geospatial data)
● correlation for continuous variables
● regression analysis for estimating the relationships among variables
● building treemaps for qualitative data analysis (hierarchical data coming soon)
Thank you!
Any suggestions?

Weitere ähnliche Inhalte

Was ist angesagt?

euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQL
Open Data Support
 

Was ist angesagt? (20)

Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Ephedra: efficiently combining RDF data and services using SPARQL federation
Ephedra: efficiently combining RDF data and services using SPARQL federationEphedra: efficiently combining RDF data and services using SPARQL federation
Ephedra: efficiently combining RDF data and services using SPARQL federation
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issues
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
 
Metadata Mapping & Crosswalks
Metadata Mapping & CrosswalksMetadata Mapping & Crosswalks
Metadata Mapping & Crosswalks
 
Smart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge GraphSmart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge Graph
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
The RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple CountThe RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple Count
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQL
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
 

Ähnlich wie Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by Vyacheslav Tykhonov, Richard Zijdeman, and Jerry de Vries

Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
plan4all
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 

Ähnlich wie Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by Vyacheslav Tykhonov, Richard Zijdeman, and Jerry de Vries (20)

Clio infra Collabs data analysis tools
Clio infra Collabs data analysis toolsClio infra Collabs data analysis tools
Clio infra Collabs data analysis tools
 
The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)
 
TEAMS 6, 7 and 8
TEAMS 6, 7 and 8TEAMS 6, 7 and 8
TEAMS 6, 7 and 8
 
Geohosting
GeohostingGeohosting
Geohosting
 
ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013ENGAGE Workshop at OpenDataWeek2013
ENGAGE Workshop at OpenDataWeek2013
 
Gsoc proposal
Gsoc proposalGsoc proposal
Gsoc proposal
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
 
Gsoc proposal 2021 polaris
Gsoc proposal 2021 polarisGsoc proposal 2021 polaris
Gsoc proposal 2021 polaris
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
Executable papers
Executable papersExecutable papers
Executable papers
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
 
Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014Geospatial metadata and spatial data workshop: 19 June 2014
Geospatial metadata and spatial data workshop: 19 June 2014
 
General concepts: DDI
General concepts: DDIGeneral concepts: DDI
General concepts: DDI
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
Dataverse opportunities
Dataverse opportunitiesDataverse opportunities
Dataverse opportunities
 
Elise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GISElise Smith & Chris Wild - Public Participatory GIS
Elise Smith & Chris Wild - Public Participatory GIS
 

Mehr von datascienceiqss

American Journal of Political Science & The Odum Institute: Promoting Researc...
American Journal of Political Science & The Odum Institute: Promoting Researc...American Journal of Political Science & The Odum Institute: Promoting Researc...
American Journal of Political Science & The Odum Institute: Promoting Researc...
datascienceiqss
 

Mehr von datascienceiqss (20)

Citing Data in Journal Articles using JATS by Deborah A. Lapeyre
Citing Data in Journal Articles using JATS by Deborah A. LapeyreCiting Data in Journal Articles using JATS by Deborah A. Lapeyre
Citing Data in Journal Articles using JATS by Deborah A. Lapeyre
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
 
iRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan CrabtreeiRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan Crabtree
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
 
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
 
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...Center for Open Science and the Open Science Framework: Dataverse Add-on by S...
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
 
Sharing Data Through Plots with Plotly by Alex Johnson
Sharing Data Through Plots with Plotly by Alex JohnsonSharing Data Through Plots with Plotly by Alex Johnson
Sharing Data Through Plots with Plotly by Alex Johnson
 
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...
 
MIT Libraries Dataverse by Katherine McNeill
MIT Libraries Dataverse by Katherine McNeillMIT Libraries Dataverse by Katherine McNeill
MIT Libraries Dataverse by Katherine McNeill
 
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...
 
Dataverse in China: Internationalization, Curation and Promotion by Yin Shenqin
Dataverse in China: Internationalization, Curation and Promotion by Yin ShenqinDataverse in China: Internationalization, Curation and Promotion by Yin Shenqin
Dataverse in China: Internationalization, Curation and Promotion by Yin Shenqin
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
Metadata & Data Curation Services by Thu-Mai Christian
Metadata & Data Curation Services by Thu-Mai ChristianMetadata & Data Curation Services by Thu-Mai Christian
Metadata & Data Curation Services by Thu-Mai Christian
 
American Journal of Political Science & The Odum Institute: Promoting Researc...
American Journal of Political Science & The Odum Institute: Promoting Researc...American Journal of Political Science & The Odum Institute: Promoting Researc...
American Journal of Political Science & The Odum Institute: Promoting Researc...
 
Political Analysis Dataverse by Jonathan N. Katz
Political Analysis Dataverse by Jonathan N. KatzPolitical Analysis Dataverse by Jonathan N. Katz
Political Analysis Dataverse by Jonathan N. Katz
 
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...
 
Dataverse in the Universe of Data by Christine L. Borgman
Dataverse in the Universe of Data by Christine L. BorgmanDataverse in the Universe of Data by Christine L. Borgman
Dataverse in the Universe of Data by Christine L. Borgman
 
Data Publishing Models by Sünje Dallmeier-Tiessen
Data Publishing Models by Sünje Dallmeier-TiessenData Publishing Models by Sünje Dallmeier-Tiessen
Data Publishing Models by Sünje Dallmeier-Tiessen
 
Persistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John KunzePersistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John Kunze
 

Kürzlich hochgeladen

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by Vyacheslav Tykhonov, Richard Zijdeman, and Jerry de Vries

  • 1. CLIO INFRA Data analysis in Dataverse & visualization of datasets on historical maps Dataverse Community meeting 2015, June 11, Harvard University Vyacheslav Tykhonov Richard Zijdeman Jerry de Vries International Institute of Social History
  • 2. Introduction The International Institute of Social History (IISH) collects information in the field of social history and makes it available to the public. The IISH is one of the major scientific heritage institutions in the Netherlands. In the field of socio-economic historical research IISH plays a prominent role. CLIO INFRA is Digital Research Infrastructure for the Arts and Humanities developed to integrate a number of databases (hubs) consisting of data on global social, economic and institutional indicators over the past five centuries, with special attention for the past 200 years. Clio Infra developed by IISH.
  • 3. Our mission Clio Infra is the bridge between Data Collections and Research Datasets. “Old school” researchers time: Collecting datasets in CSV, Access or Excel and other formats mostly on his own without real collaboration between all researchers in the same field. Very closed model. Digital Humanities era: Young researchers use various computer tools to collect, analyze and combine research datasets they’ve got from older generation of researchers. They can contribute and help each other and build communities. The future: The goal is to build tools to get datasets from “old school” researchers, standardize and store data in the Clouds in order provide access to Open Data for researchers all over the world and made possible collaboration between them.
  • 4. Clio Infra Collaboration platform Clio Infra functionality based on the Dataverse solution: - teams collaboratively can curate, share and analyze research datasets - teams members can share the responsibility to collect data on specific variables (for example, countries) and inform each other about changes and additions - dataset version control system is able to track changes in datasets - other researchers can download their own copy of the data if dataset is published as Open Data Dataverse is flexible metadata store (repository) that connected with Research datasets storage by our Data Processing Engine (DPE)
  • 5. Added Value for future Researchers The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools): ● access to a specific case study, citing and finding data ● access to the universe of data from Dataverse network that can organize and display them for browsing and searching ● data filtering: researchers with proper authorization can obtain the subset of data provided by data collector ● data analysis to run descriptive statistics and graphics, visualization, plotting on historical maps ● Data APIs to export data for further analysis by popular statistical packages (STATA, SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)
  • 6. Collaboration possibilities Descriptive metadata: - dataset file - documentation - standardized tables with codes can be part of dataset Sharing and collaboration capabilities: - requesting unique API token by every researcher - user of Dataverse - exchanging API token between researchers and granting permissions to work on the same datasets as a team using data analysis tools
  • 7. Collaboration Data Workflow - Draft Datasets are visible only for owners - with API tokens researchers can get access to interactive dashboard to get some insights about the data stored in the Dataverse - every dataset converted to dataframe - dashboard can provide access to all variables from the dataframe and visualize them on charts, graphs, historical maps and treemaps After dataset is prepared to go public, it can be published in Collabs: - guest users can download the copy of dataset - team members with permissions and authorized API token can contribute to dataset
  • 8. Data Analysis in Clio Infra ● Data Processing Engine has python core and developed for nlgis.nl (Netherlands Geographic Information System) and distributed as widget ● interactive data exploration dashboard based on D3 library ● every dataset from Dataverse is available as Data API (json) and can be connected to any statistical package ● filtering the data on specific years and variables based on dataframes (pandas, python for data analysis) ● data quality check is the quick visual tool to apply Benford’s law for all values from specific dataset ● dataframes can be open by researchers in various statistical packages like iPython notebook, R Studio, SPSS, STATA, Mathlab, etc
  • 9. Data Processing Engine (DPE) specification ● can split values from any dataset in number of categories specified by researcher (8 by default) ● algorithm to categorize data values in proper categories can be selected manually (percentile by default) ● can define maximum possible categories for specific dataset if there is no way to get categories number specified by user of the system (for example, if there are 2-3 categories of data values) ● data ranges should be defined to get possibility to visualize data on some chart or map in the right scale ● colors can be specified by user (Color Brewing, see http://colorbrewer2.org) ● legend generated and attached to all visualizations automatically ● values with missing data shown as 'no data' regions on map ● all data values delivered by Data API to make the data analysis platform independent and communicate with other systems or statistical packages
  • 10. API Service (Data API) Data API provided by Data Processing Engine is the most important functionality for the well equipped digital infrastructure: ● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel) ● use common data science programming languages like Python, R to perform more advanced research using external Data Science libraries ● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added value for the future) ● suitable for other researchers and developers to use advanced technique and data mining tools that aren’t developed yet
  • 11. Example of output from Data API ● every dataset ingested by DPE available as Data API with unique handler ● API can be filtered by variables extracted from the content of data file Example: /api/data?&handle=F16UDU:30:31&countrycode=USA&year=1880&categories=8&datarange=calculate "United States of America": { "code": "F16UDU30_31", "color": "#FF7F00", "countrycode": "USA", "id": "2085", "indicator": "Total Urban Population", "intcode": "840", "r": 923.99, "region": "W. Offshoots", "units": "x 1000", "value": 14264.0, "year": 1880 } }
  • 12. Data visualization and plotting data on historical maps ● Data Processing Engine (DPE) is the core of data visualization process and connected to geoservice ● data attributes like scales and colors calculated by DPE on the fly based on the input of researcher (for example, number of categories to split data) and the part of Data API ● histograms, cross tabulations, enhanced descriptive statistics based on pandas dataframes ● visualization of datasets on historical maps will be available to plot data on maps for last 500 years
  • 13. Historical maps services: Geoservice and Geocoder Delivered by Webmapper.nl and integrated with common Clio Infra infrastructure. Geoservice basic requirements: - should provide actual GeoAPI with historical polygons for all countries and regions based on standardized codes - available as geojson/topojson for online visualization - QGIS toolbox should be supported to upload new maps and update old maps as shape files Geocoder should be able to standardize all geographical locations in different datasets: - USSR and Soviet Union should be recognized as one country with the same PID - Germany before and after 1990 - Indonesia before and after 1999
  • 14. GeoAPI example Geoservice can provide polygons for specific years on the national or world level rendered as topojson or geojson. GeoAPI: /api/maps?world=on&year=1962 Polygons for all countries will be delivered as topojson: arcs":[[1782,2186]]}]}},"arcs":[[[8387,6231],[0,5],[1,1],[1,-1],[2,0],[2,-1],[3,-4],[1, -3],[0,-1],[-1,-5],[0,-1],[-1,2],[0,1],[-2,2],[-3,1],[-1,0],[-1,0],[-1,3],[0,1]], [[8390,6247],[1,1],[0,1],[2,1],[1,0],[1,-2],[-1,-5],[-1,0],[-1,1],[-1,1],[-1,1],[0,1]], [[8391,6204],[0,2],[-1,1],[-1,-1],[0,1],[0,1],[1,3],[1,0],[2,-6],[0,-1],[0,-1],[0,-1], [-1,1],[-1,1]],[[8364,6093],[0,2],[2,5],[1,0],[1,-2],[-1,-6],[-1,-3],[-1,0],[-1,0],[0,1], [0,3]],[[5941,6575],[0,-1],[-1,0],[-1,1],[0,1],[-1,0],[-1,0],[-1,0],[-1,0],[0,-1],[-1,-1], [0,-2],[0,-1],[0,-2],[0,-1],[-1,-3],[-3,-2],[-4,-4],[-1,-1],[-1,0],[-2,-1],[-2,0],[-1,0],[-1, -1],[-1,-2],[-5,1],[-1,0],[-1,-1],[-1,-1],[-1,0],[-2,1],[-4,3],[-1,1],[-1,2],[-2,8],[-1,4], [-1,9],[0,1],[0,2],[0,1],[1,0],[0,-1],[0,-1],[1,-1],[0,-1],[1,0]
  • 15. Integration of Dataverse, DPE, geoservice and geocoder in 6 steps 1. Data API communicates with geocoder to find standardized codes for all locations in dataset 2. the same geocode with associated polygons is coming from geoservice 3. data and geoservice matching in the frontend by any geospatial javascript library 4. attributes like colors and scales filling the map polygons 5. legend with scales, values and colors coming from DPE to make map look complete 6. notes, sources and other metadata extracting from dataverse API to provide clear explanation of generated historical map Live demo for Labor Conflicts in 100 years
  • 16. Dataset Quality Check Benford’s law to do test of the quality of data Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled data values (for example, characters in values where numbers expected) Chart visualization to get overview of missing data values
  • 17. Benford’s law in action on real dataset
  • 18. Datasets combining and aggregation in data exploration tools ● tools that can predict the same variables disambiguated in different datasets (Year and Jaar) ● geocoding services can standardize different regions (Netherlands → NL, Amsterdam → AMS) ● all possible relationship paths can be ranked from the “best” (100%) to the “worst” (0%), for example, value “05” for variable “Month” can be recognized as “May” ● standardized datasets can be used as “reference” data for other datasets from other researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT, DANS)
  • 19. Clio Infra Infrastructure is suitable for different kinds of datasets Quantitative datasets that store quantity data values (numbers): - data measured (length, speed, height, age) - example: current clio-infra datasets Qualitative datasets store quality observations (descriptions): - data can be observed (colors, textures, professions, groups) - example: HISCO, world strikes dataset Data Visualization is different for different kinds of datasets: - quantity can be plotted on charts, graphs, historical maps - quality (hierarchy) can be visualized on treemaps and maps
  • 20. Example: Interactive data exploration based on API token
  • 21. Summary: data exploration and analysis will extend Dataverse functionality ● data quality check during upload/update of dataset ● automatic ingestion and recognition of years and locations in datasets ● integrated with geocoder and geoservice to get polygons for maps ● interactive dashboard to do visual exploration of variables from dataset (graph / histogram / scatterplot) ● data processing engine to plot data on interactive historical maps (if dataset has geospatial data) ● correlation for continuous variables ● regression analysis for estimating the relationships among variables ● building treemaps for qualitative data analysis (hierarchical data coming soon)