Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
Today, we count more than 10,000 datasets made available online following Semantic Web standards.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large-scale knowledge graphs in order to facilitate applications in various domains including life sciences, publishing, and the internet of things.
The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before.
First, we propose a novel approach for statistical calculations of large RDF datasets, which scales out to clusters of machines.
In particular, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark.
Many applications such as data integration, search, and interlinking, may take full advantage of the data when having a priori statistical information about its internal structure and coverage.
However, such applications may suffer from low quality and not being able to leverage the full advantage of the data when the size of data goes beyond the capacity of the resources available.
Thus, we introduce a distributed approach of quality assessment of large RDF datasets.
It is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data.
Based on the knowledge of the internal statistics of a dataset and its quality, users typically want to query and retrieve large amounts of information.
As a result, it has become difficult to efficiently process these large RDF datasets.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
Therefore, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets by translating SPARQL queries into Spark executable code.
We conducted several empirical evaluations to assess the scalability, effectiveness, and efficiency of our proposed approaches.
More importantly, various use cases i.e. Ethereum analysis, Mining Big Data Logs, and Scalable Integration of POIs, have been developed and leverages by our approach.
The empirical evaluations and concrete applications provide evidence that our methodology and techniques proposed during this thesis help to effectively analyze and process large-scale RDF datasets.
All the proposed approaches during this thesis are integrated into the larger SANSA framework.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
The document discusses performance evaluation of social network analysis algorithms using Apache Spark. It analyzes the performance of algorithms like PageRank, connected components, triangle counting and K-means clustering on different social network datasets. The results show that GraphX PageRank performs faster than the naive implementation in Spark. Connected components execution time grows super linearly initially and then fluctuates. Triangle counting time grows linearly with size. K-means clustering is tested using both naive and MLlib implementations in Spark.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
The document discusses performance evaluation of social network analysis algorithms using Apache Spark. It analyzes the performance of algorithms like PageRank, connected components, triangle counting and K-means clustering on different social network datasets. The results show that GraphX PageRank performs faster than the naive implementation in Spark. Connected components execution time grows super linearly initially and then fluctuates. Triangle counting time grows linearly with size. K-means clustering is tested using both naive and MLlib implementations in Spark.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
Enterprise knowledge graphs use semantic technologies like RDF, RDF Schema, and OWL to represent knowledge as a graph consisting of concepts, classes, properties, relationships, and entity descriptions. They address the "variety" aspect of big data by facilitating integration of heterogeneous data sources using a common data model. Key benefits include providing background knowledge for various applications and enabling intra-organizational data sharing through semantic integration. Challenges include ensuring data quality, coherence, and managing updates across the knowledge graph.
The document describes the R User Conference 2014 which was held from June 30 to July 3 at UCLA in Los Angeles. The conference included tutorials on the first day covering topics like applied predictive modeling in R and graphical models. Keynote speeches and sessions were held on subsequent days covering various technical and statistical topics as well as best practices in R programming. Tutorials and sessions demonstrated tools and packages in R like dplyr and Shiny for data analysis and interactive visualizations.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
This document discusses the capabilities and performance of Virtuoso, an open-source database for managing and querying semantic data. It describes how Virtuoso uses techniques like column storage, vector execution, and structure awareness to achieve SQL and SPARQL query performance on par with specialized relational databases. The document also outlines several European Union-funded research projects aimed at further improving RDF database performance and scaling through benchmarks, geospatial extensions, and graph analytics.
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
The Physics Department of the University of Cagliari and the Linkalab Group invited me to talk about the Semantic Web and Linked Data - this is simply an introduction to the technologies involved.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
This document discusses interaction with linked data, focusing on visualization techniques. It begins with an overview of the linked data visualization process, including extracting data analytically, applying visualization transformations, and generating views. It then covers challenges like scalability, handling heterogeneous data, and enabling user interaction. Various visualization techniques are classified and examples are provided, including bar charts, graphs, timelines, and maps. Finally, linked data visualization tools and examples using tools like Sigma, Sindice, and Information Workbench are described.
The Bounties of Semantic Data Integration for the Enterprise Ontotext
Semantic data integration allows enterprises to connect heterogeneous data sources through a common language. This creates a unified 360-degree view of enterprise data and facilitates knowledge management and use. Semantic integration aims to enrich existing data with external knowledge and provide a single access point for enterprise assets. It addresses challenges of accessing and storing data from various internal resources by building a well-structured integrated whole to enhance business processes.
This presentation covers the whole spectrum of Linked Data production and exposure. After a grounding in the Linked Data principles and best practices, with special emphasis on the VoID vocabulary, we cover R2RML, operating on relational databases, Open Refine, operating on spreadsheets, and GATECloud, operating on natural language. Finally we describe the means to increase interlinkage between datasets, especially the use of tools like Silk.
Das Semantische Daten Web für UnternehmenSören Auer
This document summarizes the vision, technology, and applications of the Semantic Data Web for businesses. It discusses how the Semantic Web can help solve problems of searching for complex information across different data sources by complementing text on web pages with structured linked open data. It provides overviews of RDF standards, vocabularies, and technologies like SPARQL and OntoWiki that allow creating and managing structured knowledge bases. It also presents examples like DBpedia that extract structured data from Wikipedia and make it available on the web as linked open data.
This document discusses clustering of RDF data across the Semantic Web. It begins by describing the Linking Open Data project and the growing amount of RDF data available. It then discusses the motivations for clustering RDF data, such as improving data access and query response times over distributed machines. Current approaches to RDF clustering are also summarized, including extracting instance subgraphs and computing distances between instances. The document outlines different techniques for instance extraction and distance computation in RDF clustering.
The document discusses big data and linked data. It presents the three V's of big data - volume, velocity, and variety. It shows the semantic web layer cake and how linked data provides a lingua franca for data integration. It provides examples of using linked data for sensor data, supply chain data, and as a bridge between online and offline systems. Finally, it discusses adding a linked data layer to the existing internet architecture and engaging more stakeholders with the technology.
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API.
This document discusses using Apache Hadoop and SQL Server to analyze large datasets. It finds that SQL Server struggles to efficiently query and analyze datasets with over 100 million rows, with query times increasing substantially with larger datasets. Apache Hadoop provides a more scalable solution by distributing data processing across a cluster. The document evaluates Hadoop and MongoDB for big data analysis, and chooses Hadoop for its ability to process large amounts of data for analytical purposes. It then discusses implementing Hortonworks Data Platform with Apache Ambari to analyze a 97GB population dataset using Hadoop.
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
In this deck from the 2019 Stanford HPC Conference, Srinivasan Parthasarathy from Ohio State University presents: Scalable Machine Learning: The Role of Stratified Data Sharding.
"With the increasing popularity of structured data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and learning from such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is daunting. In this talk I will begin with discussing some of these challenges. Subsequently I will discuss a critical element at the heart of this challenge relates to the sharding, placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this eco-system according to the needs of the application to maximize locality, balance load, minimize data skew or even take into account energy consumption. Results on several real-world applications validate the efficacy and efficiency of our approach. (Notes: Joint work with Y. Wang (Airbnb) and A. Chakrabarti (MSR))."
Srinivasan Parthasarathy, Professor of Computer Science & Engineering, The Ohio State University
Srinivasan Parthasarathy is a Professor of Computer Science and Engineering and the director of the data mining research laboratory at Ohio State. His research interests span databases, data mining and high performance computing. He is among a handful of researchers nationwide to have won both the Department of Energy and National Science Foundation Career awards. He and his students have won multiple best paper awards or "best of" nominations from leading forums in the field including: SIAM Data Mining, ACM SIGKDD, VLDB, ISMB, WWW, ICDM, and ACM Bioinformatics. He chairs the SIAM data mining conference steering committee and serves on the action board of ACM TKDD and ACM DMKD --leading journals in the field. Since 2012 he also helped lead the creation of OSU's first-of-a-kind nationwide (USA) undergraduate major in data analytics and serves as one of its founding directors.
Watch the video: https://youtu.be/hOJI8e0p-UI
Learn more: http://web.cse.ohio-state.edu/~parthasarathy.2/
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Enterprise knowledge graphs use semantic technologies like RDF, RDF Schema, and OWL to represent knowledge as a graph consisting of concepts, classes, properties, relationships, and entity descriptions. They address the "variety" aspect of big data by facilitating integration of heterogeneous data sources using a common data model. Key benefits include providing background knowledge for various applications and enabling intra-organizational data sharing through semantic integration. Challenges include ensuring data quality, coherence, and managing updates across the knowledge graph.
The document describes the R User Conference 2014 which was held from June 30 to July 3 at UCLA in Los Angeles. The conference included tutorials on the first day covering topics like applied predictive modeling in R and graphical models. Keynote speeches and sessions were held on subsequent days covering various technical and statistical topics as well as best practices in R programming. Tutorials and sessions demonstrated tools and packages in R like dplyr and Shiny for data analysis and interactive visualizations.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
This document discusses the capabilities and performance of Virtuoso, an open-source database for managing and querying semantic data. It describes how Virtuoso uses techniques like column storage, vector execution, and structure awareness to achieve SQL and SPARQL query performance on par with specialized relational databases. The document also outlines several European Union-funded research projects aimed at further improving RDF database performance and scaling through benchmarks, geospatial extensions, and graph analytics.
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
The Physics Department of the University of Cagliari and the Linkalab Group invited me to talk about the Semantic Web and Linked Data - this is simply an introduction to the technologies involved.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
This document discusses interaction with linked data, focusing on visualization techniques. It begins with an overview of the linked data visualization process, including extracting data analytically, applying visualization transformations, and generating views. It then covers challenges like scalability, handling heterogeneous data, and enabling user interaction. Various visualization techniques are classified and examples are provided, including bar charts, graphs, timelines, and maps. Finally, linked data visualization tools and examples using tools like Sigma, Sindice, and Information Workbench are described.
The Bounties of Semantic Data Integration for the Enterprise Ontotext
Semantic data integration allows enterprises to connect heterogeneous data sources through a common language. This creates a unified 360-degree view of enterprise data and facilitates knowledge management and use. Semantic integration aims to enrich existing data with external knowledge and provide a single access point for enterprise assets. It addresses challenges of accessing and storing data from various internal resources by building a well-structured integrated whole to enhance business processes.
This presentation covers the whole spectrum of Linked Data production and exposure. After a grounding in the Linked Data principles and best practices, with special emphasis on the VoID vocabulary, we cover R2RML, operating on relational databases, Open Refine, operating on spreadsheets, and GATECloud, operating on natural language. Finally we describe the means to increase interlinkage between datasets, especially the use of tools like Silk.
Das Semantische Daten Web für UnternehmenSören Auer
This document summarizes the vision, technology, and applications of the Semantic Data Web for businesses. It discusses how the Semantic Web can help solve problems of searching for complex information across different data sources by complementing text on web pages with structured linked open data. It provides overviews of RDF standards, vocabularies, and technologies like SPARQL and OntoWiki that allow creating and managing structured knowledge bases. It also presents examples like DBpedia that extract structured data from Wikipedia and make it available on the web as linked open data.
This document discusses clustering of RDF data across the Semantic Web. It begins by describing the Linking Open Data project and the growing amount of RDF data available. It then discusses the motivations for clustering RDF data, such as improving data access and query response times over distributed machines. Current approaches to RDF clustering are also summarized, including extracting instance subgraphs and computing distances between instances. The document outlines different techniques for instance extraction and distance computation in RDF clustering.
The document discusses big data and linked data. It presents the three V's of big data - volume, velocity, and variety. It shows the semantic web layer cake and how linked data provides a lingua franca for data integration. It provides examples of using linked data for sensor data, supply chain data, and as a bridge between online and offline systems. Finally, it discusses adding a linked data layer to the existing internet architecture and engaging more stakeholders with the technology.
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API.
This document discusses using Apache Hadoop and SQL Server to analyze large datasets. It finds that SQL Server struggles to efficiently query and analyze datasets with over 100 million rows, with query times increasing substantially with larger datasets. Apache Hadoop provides a more scalable solution by distributing data processing across a cluster. The document evaluates Hadoop and MongoDB for big data analysis, and chooses Hadoop for its ability to process large amounts of data for analytical purposes. It then discusses implementing Hortonworks Data Platform with Apache Ambari to analyze a 97GB population dataset using Hadoop.
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
In this deck from the 2019 Stanford HPC Conference, Srinivasan Parthasarathy from Ohio State University presents: Scalable Machine Learning: The Role of Stratified Data Sharding.
"With the increasing popularity of structured data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and learning from such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is daunting. In this talk I will begin with discussing some of these challenges. Subsequently I will discuss a critical element at the heart of this challenge relates to the sharding, placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this eco-system according to the needs of the application to maximize locality, balance load, minimize data skew or even take into account energy consumption. Results on several real-world applications validate the efficacy and efficiency of our approach. (Notes: Joint work with Y. Wang (Airbnb) and A. Chakrabarti (MSR))."
Srinivasan Parthasarathy, Professor of Computer Science & Engineering, The Ohio State University
Srinivasan Parthasarathy is a Professor of Computer Science and Engineering and the director of the data mining research laboratory at Ohio State. His research interests span databases, data mining and high performance computing. He is among a handful of researchers nationwide to have won both the Department of Energy and National Science Foundation Career awards. He and his students have won multiple best paper awards or "best of" nominations from leading forums in the field including: SIAM Data Mining, ACM SIGKDD, VLDB, ISMB, WWW, ICDM, and ACM Bioinformatics. He chairs the SIAM data mining conference steering committee and serves on the action board of ACM TKDD and ACM DMKD --leading journals in the field. Since 2012 he also helped lead the creation of OSU's first-of-a-kind nationwide (USA) undergraduate major in data analytics and serves as one of its founding directors.
Watch the video: https://youtu.be/hOJI8e0p-UI
Learn more: http://web.cse.ohio-state.edu/~parthasarathy.2/
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
ABSTRACT
The amount of data stored in IoT databases increases as the IoT applications extend throughout smart city appliances, industry and agriculture. Contemporary database systems must process huge amounts of sensory and actuator data in real-time or interactively. Facing this first wave of IoT revolution, database vendors struggle day-by-day in order to gain more market share, develop new capabilities and attempt to overcome the disadvantages of previous releases, while providing features for the IoT.
There are two popular database types: The Relational Database Management Systems and NoSQL databases, with NoSQL gaining ground on IoT data storage. In the context of this paper these two types are examined. Focusing on open source databases, the authors experiment on IoT data sets and pose an answer to the question which one performs better than the other. It is a comparative study on the performance of the commonly market used open source databases, presenting results for the NoSQL MongoDB database and SQL databases of MySQL and PostgreSQL
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
Map Reduce based on Cloak DHT Data Replication Evaluationijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of un-structured, semi-structured and structured data to extract new knowledge. Replications offer better performance and greater availability of data. With the advent of Big Data, new storage and processing challenges are emerging.
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
This document summarizes a seminar presentation on big data analytics. It reviews 25 research papers published between 2011-2014 on issues related to big data analysis, real-time big data analysis using Hadoop in cloud computing, and classification of big data using tools and frameworks. The review process involved a 5-stage analysis of the papers. Key issues identified include big data analysis, real-time analysis using Hadoop in clouds, and classification using tools like Hadoop, MapReduce, HDFS. Promising solutions discussed are MapReduce Agent Mobility framework, PuntStore with pLSM index, IOT-StatisticDB statistical database mechanism, and visual clustering analysis.
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
This document discusses data mining of big data using Hadoop and MongoDB. It provides an overview of Hadoop and MongoDB and their uses in big data analysis. Specifically, it proposes using Hadoop for distributed processing and MongoDB for data storage and input. The document reviews several related works that discuss big data analysis using these tools, as well as their capabilities for scalable data storage and mining. It aims to improve computational time and fault tolerance for big data analysis by mining data stored in Hadoop using MongoDB and MapReduce.
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
This document discusses data mining of big data using Hadoop and MongoDB. It provides an overview of Hadoop and MongoDB and their uses in big data analysis. Specifically, it proposes using Hadoop for distributed processing and MongoDB for data storage and input. The document reviews several related works that discuss big data analysis using these tools, as well as their capabilities for scalable data storage and mining. It aims to improve computational time and fault tolerance for big data analysis by mining data stored in Hadoop using MongoDB and MapReduce.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
This document outlines the syllabus for a course on Big Data Analytics. It includes 5 modules that cover topics such as introduction to Hadoop, MapReduce and machine learning algorithms for big data. The syllabus provides learning objectives, textbook references, module outlines and topics to be covered in each module. It also specifies course outcomes, question paper pattern and module overviews that give concise descriptions of the key concepts covered in each topic.
Big data processing using - Hadoop TechnologyShital Kat
This document summarizes a report on Hadoop technology as a solution to big data processing. It discusses the big data problem, including defining big data, its characteristics and challenges. It then introduces Hadoop as a solution, describing its components HDFS for storage and MapReduce for parallel processing. Examples of common friend lists and word counting are provided. Finally, it briefly mentions some Hadoop projects and companies that use Hadoop.
This document discusses information processing architectures and models. It covers topics like online transaction processing, online analytical processing, complex event processing, and massively parallel processing. It also discusses shared nothing, shared disk and shared everything infrastructure models. The document then covers database architectures, tradeoffs in task scheduling, and the map reduce approach used for big data processing. Other topics discussed include ACID, BASE and CAP theories; distributed information management; data warehousing models; business intelligence models; multi-tier enterprise applications; mobile data progress; social, mobile and cloud computing; and cloud infrastructures for information processing.
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
This document discusses frameworks for processing big data that is distributed across geographic locations. It begins by introducing the challenges of geo-distributed big data processing and then describes several MapReduce-based frameworks like G-Hadoop and G-MR that can process pre-located geo-distributed data. It also covers Spark-based systems like Iridium and frameworks that partition data across geographic locations, such as KOALA grid-based systems. The document analyzes key aspects of geo-distributed big data processing systems like data distribution, task scheduling, and fault tolerance.
This document discusses big data solutions and introduces Hadoop. It defines common big data problems related to volume, velocity, and variety of data. Traditional storage does not work well for this type of unstructured data. Hadoop provides solutions through HDFS for storage, MapReduce for processing, and additional tools like HBase, Pig, Hive, Zookeeper, and Spark to handle different data and analytic needs. Each tool is described briefly in terms of its purpose and how it works with Hadoop.
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
This document discusses Dr. Hadoop, a new framework proposed by authors Dipayan Dev and Ripon Patgiri to provide efficient and scalable metadata management for Hadoop. It addresses key issues with Hadoop's current single point of failure for metadata on the NameNode. The new framework is called Dr. Hadoop and uses a technique called Dynamic Circular Metadata Splitting (DCMS) that distributes metadata uniformly across multiple NameNodes for load balancing while also preserving metadata locality through consistent hashing and locality-preserving hashing. Dr. Hadoop aims to provide infinite scalability for metadata as data scales to exabytes without affecting throughput.
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
This document proposes using a ranking algorithm and sampling algorithm to improve the performance of a heterogeneous Hadoop cluster. The ranking algorithm prioritizes data distribution based on node frequency, so that higher frequency nodes are processed first. The sampling algorithm randomly selects nodes for data distribution instead of evenly distributing across all nodes. The proposed approach reduces computation time and improves overall cluster performance compared to the existing approach of evenly distributing data across nodes of varying sizes. Results show the proposed approach reduces execution time for various file sizes compared to the existing approach.
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2014 tools including SSMS, SSIS, and SSDT.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
Similar to Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva (20)
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
4. No single definition
Extremely large data sets that may be analysed computationally to
reveal patterns, trends, and associations, especially relating to human
behaviour and interactions
Big data is a term for data sets that are so large or complex that
traditional data processing application softwares are inadequate to deal
with them
What is Big Data?
4
5. It’s relevance is increasing drastically and Big Data Analytics is an
emerging field to explore
Why ‘BigData’ is so important?
5
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
8. Big Data Europe (BDE) Platform
8https://github.com/big-data-europe
Support Layer
Init Daemon
GUIs
Monitor
App Layer
Traffic
Forecast
Satellite Image
Analysis
Platform Layer
Spark Flink Semantic Layer
Ontario SANSA Semagrow
Kafka
Real-time Stream
Monitoring
...
...
Resource Management Layer (Swarm)
Hardware Layer
Premises Cloud (AWS, GCP, MS Azure, …)
Data Layer
Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
9. Fast and generic-purpose cluster computing engine
Apache Spark
9
Spark Core Engine (RDD)
Deploy
SparkSQL&
DataFrames
CoreAPIs&
Libraries
SparkStreaming
Local
Single
JVM
Cluster
(Standalone,
Mesos, YARN)
Containers
docker-comp
ose
MLlib
MachineLearning
GraphX
Graphprocessing
Allows for massive parallel processing of
collections of records
- RDD - Resilient Distributed Dataset
- DataFrame - Conceptually a table
- Dataset - Unified access to data as objects
and/or tables
10. Heterogeneity aka Variety
Key Observation From BDE
10
Banking
Finance
Our
Known
History
PurchaseEntertain
Gaming
Social
Media
VISA
CHASE
SAP
IBM
NORDSTROM
Amazon
LOWES
NETFLIX
HULU
NFb NETWORK
Zynga
XBOX 360
Facebook
Pinterest
Twitter
Customer
11. Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
11
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
12. Modelling entities and their relationships
Analysis: finding underlying structure of the graph e.g. to predict
unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO,
Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData
Knowledge Graphs
12
13. Knowledge Graphs are everywhere
13
Entity Search and Summarization
Discovering Related Entities
14. Tasks that are hard to solve on single machines (>1 TB memory
consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG, StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g. DBpedia →
LinkedGeoData
Why Distributed RDF Data Processing?
14
15. Main Research Question
Is it possible to process large-scale RDF
datasets efficiently and effectively?
15
16. RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
16
17. RC1: A Scalable Distributed Approach for Computation of RDF Dataset
Statistics
RC2: A Scalable Framework for Quality Assessment of RDF Datasets
RC3: A Scalable Framework for SPARQL Evaluation of Large RDF Data
Contributions
17
19. SANSA [1] is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computation over
large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
19
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
20. RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
20
22. Obtaining an overview over the Web of Data, it is important to gather
statistical information describing characteristics of the internal
structure of datasets
This process is both data-intensive and computing-intensive and it is a
challenge to develop fast and efficient algorithms that can handle large
scale RDF datasets
There are no approaches for RDF that computes those statistical criteria
and scales to large data sets
Motivation
22
23. A statistical criterion C is a triple C = (F, D, P), where:
- F is a SPARQL filter condition
- D is a derived dataset from the main dataset (RDD of triples) after
applying F
- P is a post-processing operation on the data structure D
RDDs are in-memory collections of records that can be operated in
parallel on large clusters
- We use RDDs to represent RDF triples
Approach
23
28. RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
28
29. Quality Assessment of RDF
Datasets at Scale
A Scalable Framework for Quality Assessment of
RDF Datasets [3]
29
30. Assessing data quality is of paramount importance to judge its fitness
for particular use case
Existing solutions can not evaluate data quality metrics on medium /
large-scale datasets
→ This is actually where they are most important
Motivation
30
31. Quality Assessment Pattern (QAP)
- A reusable template to implement and design scalable quality
metrics
Approach
31
Quality Metric(QM) := Action|(QM OP Action)
OP := ∗|−|/|+
Action := Count(Transformation)
Transformation := Rule(Filter)|(Transformation BOP Transformation)
Filter := getPredicates∼?p|getSubjects∼?s|getObjects∼?o|getDistinct(Filter)
|Filter or Filter|Filter && Filter)
Rule := isURI(Filter)|isIRI(Filter)|isInternal(Filter)|isLiteral(Filter)
|!isBroken(Filter)|hasPredicateP|hasLicenceAssociated(Filter)
|hasLicenceIndications(Filter)|isExternal(Filter)|hasType((Filter)
|isLabeled(Filter)
BOP := ∩|∪
32. Architecture Overview
32
Definition
● Define quality dimensions
● Define quality metrics, threshold and other configurations
RDF Data
Qualityassessment
SANSA Engine
DataIngestion
Distributed Data
Structures
QAP
Results
Analyse
SANSA-NotebooksData Quality Vocabulary (DQV)
33. Experimental Setup
- Cluster configuration
- 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
Local mode: single instance of the cluster
- Datasets (all in .nt format)
Evaluation
33
DBpedia BSBM
LinkedGeoData en de fr 2GB 20GB 200GB
#nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057
size (GB) 191.17 114.4 48.6 49.77 2 20 200
36. RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
36
37. Scalable RDF Querying
Sparklify: A Scalable Software Component for
Efficient evaluation of SPARQL queries over
distributed RDF datasets* [4]
37* A joint work with Claus Stadler, a PhD student at the University of Leipzig.
38. Existing solutions are narrowed down to simple RDF constructs only
Hence they do not exploit the full potential of the knowledge i.e. RDF
terms
Can we re-use existing Ontology-Based Data Access (OBDA) tooling to
facilitate running SPARQL queries on RDF kept in Apache Spark?
Motivation
38
39. Sparklify: Architecture Overview
39
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
40. Experimental Setup
- Cluster configuration
- 7 nodes (1 master, 6 worker), each with Intel(R) Xeon(R) CPU E5-2620 v4 @
2.10GHz (32
- Cores), 128 GB RAM, 12 TB SATA RAID-5, connected via a Gigabit network
- Each experiment executed 3 times, avg’ed results
Datasets (all in .nt format)
Evaluation
40
LUBM WatDiv
1K 5K 10K 10M 100M 1B
#nr. of triples 138,280,374 690,895,862 1,381,692,508 10,916,457 108,997,714 1,099,208,068
size (GB) 24 116 232 1.5 15 150
41. Evaluation
41
Runtime (s) (mean)
SPARQLGX-SDE Sparklify
-----> a) total b) partitioning c) querying d) total
QC 103.24 134.81 61 195.84
QF 157.8 236.06 107.33 349.51
QL 102.51 241.24 134 370.3
QS 131.16 237.12 108.56 346
QC partial fail 778.62 2043.66 2829.56
QF 6734.68 1295.3 2576.52 3871.97
QL 2575.72 1275.22 610.66 1886.73
QS 4841.85 1290.72 1552.05 2845.3
Watdiv-1BWatdiv-10M
44. Sparklify vs SPARQLGX-SDE per query type performance on WatDiv
100M
Evaluation
44Query Types: (QS: Star pattern, QL: Linear pattern, QF: Snowflake, QC: Complex pattern)
46. Are existing solutions more effective i.e. using property tables which
leads to reducing the number of necessary joins and unions?
What happens when not all subjects in a cluster will use all properties?
- Wide property tables may be very sparse containing many NULL
values and thus impose a large storage overhead
How about using a more flatten approach? i.e. partition into
subject-based grouping (e.g. all entities which are associated with a
unique subject)
Motivation
46
47. Semantic-Based: Architecture Overview
47
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
Results
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn
?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
48. Experimental Setup
- Cluster configuration
- 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
- Datasets (all in nt format)
- Distributed SPARQL query evaluators we compare with:
- SHARD, SPARQLGX-SDE, and Sparklify
Evaluation
48
LUBM WatDiv
1K 2K 3K 10M 100M
#nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714
size (GB) 24 49 70 1.5 15
53. 53
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<http://slipo.eu/>
Categorizing Areas
of Interests (AOI)
SLIPO focuses on designing
efficient pipelines dealing with
large semantic datasets of POIs.
In this project, Sparklify is used
through the SANSA query layer
to refine, filter and select the
relevant POIs which are needed
by the pipelines
10+ more use cases
http://sansa-stack.net/powered-by/
Powered By
54. The Hubs and Authorities Transaction
Network Analysis
54
Amazon S3
buckets
EthOn RDF
triples
Connected Components
SANSA Engine
Data ingestion
Data partition
Querying (SPARQL)
Hubs & Authorities
entities
PageRank
Connected
Components
Top Accounts, Hubs & Authorities, Wallet
Exchange behaviorData visualization using the
Databricks notebooks or SANSA
notebooks
More than 18,000,000,000 facts*
*https://medium.com/alethio/ethereum-linked-data-b72e6283812f
56. Pipe different clustering algorithms at once
Scalable Integration of Big POI Data
56
RDF POI
Data
Pre
processing
SPARQL
Filtering
POI_ID Cat1 Cat2
1 0 1
2 1 0
3 0 1
4 1 1
Word Embedding
Semantic Clustering
Geo
Clustering
58. RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
- First algorithm for computing RDF dataset statistics at scale using
Apache Spark
- An analysis of the complexity of the computational steps and the
data exchange between nodes in the cluster
- Integrated the approach into the SANSA framework
- A REST Interface for triggering RDF statistics calculation
Review of the Contributions
58
59. RQ2: Can we scale RDF dataset quality assessment horizontally?
- A Quality Assessment Pattern QAP to characterize scalable quality
metrics
- A distributed (open source) implementation of quality metrics using
Apache Spark
- Analysis of the complexity of the metric evaluation
- Evaluate our approach and demonstrate empirically its superiority
over a previous centralized approach
- Integrated the approach into the SANSA framework
Review of the Contributions
59
60. RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
- A novel approach for vertical partitioning including RDF terms and a
scalable query system (Sparklify) using SPARQL-to-SQL rewriter on
top of Apache Spark
- A scalable semantic-based partitioning and semantic-based query
engine (SANSA.Semantic) on top of Apache Spark
- Evaluation of the proposed approaches with state-of-the-art
engines and demonstrate it empirically
- Integrated the approaches into the SANSA framework
Review of the Contributions
60
61. Large-scale RDF Dataset Statistics
- Our approach is purely batch processing, in which the data chunks
are normally very large, therefore we plan to investigate additional
techniques for lowering the network overhead and I/O footprint i.e.
HDT compression
- Near real-time computation of RDF dataset statistics using Spark
Streaming
Limitations and Future Directions
61
62. Assessment of RDF Datasets at Scale
- Intelligent partitioning strategies and perform dependency analysis
in order to evaluate multiple metrics simultaneously
- Real-time interactive quality assessment of large-scale RDF data
using Spark Streaming
- A declarative plugin using Quality Metric Language (QML), with the
ability to express, customize and enhance quality metrics
- Quality Assessment As a Service
- Quality check over LODStats
Limitations and Future Directions
62
63. Scalable RDF Querying
- Combine OBDA tools with dictionary encoding of RDF terms as
integers and evaluate the effects
- Extend our parser to support more SPARQL fragments and adding
statistics to the query engine while evaluating queries
- Investigate the re-ordering of the BGPs and evaluate the effects on
query execution time
- Consider other management operations i.e. additions, updates,
deletions i.e. DeltaLake solution as an alternative for storage layer
that brings ACID transactions to RDF data management solutions
Limitations and Future Directions
63
64. Adaptive Distributed RDF Querying
- Optimize index structures and distribute data based on anticipated
query workloads of particular inference or ML algorithms
Efficient Recommendation System for RDF Partitioners
- A recommender to suggest the “best partitioner” for our SPARQL
query evaluators based on the structure of the data (statistics)
A Powerful Benchmarking Suite
Limitations and Future Directions
64
65. With the increasing amount of the RDF data, processing large-scale RDF
datasets are constantly facing challenges
We have shown the benefits of using distributed computing frameworks
for a scalable and efficient processing of RDF datasets
Future research work can build upon the contributions presented during
this thesis for a comprehensive scalable processing of RDF datasets
The main contributions of this thesis have been integrated within the
SANSA framework making an impact on the semantic web community
Closing Remarks
65
67. [1]. Distributed Semantic Analytics using the SANSA Stack. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick
Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga;
and Hajira Jabeen. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC'2017), 2017.
[2]. DistLODStats: Distributed Computation of RDF Dataset Statistics. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed Nadjib-Mami. In Proceedings of 17th International Semantic Web Conference, 2018.
[3]. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira
Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019.
[4]. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets.
Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web
Conference, 2019.
[5]. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu; Damien
Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann. In 15th International Conference on Semantic
Systems (SEMANTiCS), 2019.
References
67
69. SPARQL is a standard query language for retrieving and manipulating
RDF data
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?hq ?location
WHERE {
dbr:Deutsche_Post foaf:name ?name.
dbr:Deutsche_Post dbo:location ?hq.
?hq foaf:name ?location.
}
Querying Knowledge Graphs
69
70. Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
70
Source: LOD-Cloud (http://lod-cloud.net/ )
72. Overall Breakdown of DistLODStats by Criterion Analysis (log scale)
Evaluation
72
73. STATisfy: A REST Interface for DistLODStats
73
CollaborativeAnalyticsServices
Marketplace
REST
Server
BigDataEurope
Local Cluster
Standalone Resource manager
Master
Worker 1 Worker 2 Worker n
SANSA DistLODStats
74. QAP: consists of transformations and actions
- Transformation: Rule set or a union/intersection of transformations
- Rules: defines conditional criteria for a triple e.g. isIRI()
- Filter: retrieves a subset of an RDF triple, e.g. getPredicates
- Shortcuts ?s, ?p, ?o are frequently used for filters
- Action: maps a triple set to a numerical value, e.g. count(r)
Quality Assessment Patterns (QAPs)
74
Metric Transformation τ Action α
External Linkage r_1 = isIRI(?s)∩internal(?s)∩isIRI(?o)∩external(?o) α_1 = count(r_3)
r_2 = isIRI(?s)∩external(?s)∩isIRI(?o)∩internal(?o) α_2 = count(triples)
r_3 = r_1∪r_2 α= a_1/a_2
75. Overall analysis of DistQualityAssessment by metric in the cluster mode
(log scale)
Evaluation
75
76. Overall analysis of queries on LUBM-1K dataset (cluster mode) using
Semantic-based approach
Evaluation
76