Knowledge extraction and incorporation is currently considered to be beneficial for efficient Big Data analytics. Knowledge can take part in workflow design, constraint definition, parameter selection and configuration, human interactive and decision-making strategies. Here we present BIGOWL, an ontology to support knowledge management in Big Data analytics. BIGOWL is designed to cover a wide vocabulary of terms concerning Big Data analytics workflows, including their components and how they are connected, from data sources to the analytics visualization. It also takes into consideration aspects such as parameters, restrictions and formats. This ontology defines not only the taxonomic relationships between the different concepts, but also instances representing specific individuals to guide the users in the design of Big Data analytics workflows. For testing purposes, two case studies are developed, which consists in: first, real-world streaming processing with Spark of traffic Open Data, for route optimization in urban environment of New York city; and second, data mining classification of an academic dataset on local/cloud platforms. The analytics workflows resulting from the BIGOWL semantic model are validated and successfully evaluated.
1. IATECH MÁLAGA
BIGOWL: Using Semantics to
Develop Big Data Analytics
Solutions
José Manuel García Nieto
jnieto@lcc.uma.es
2. IATECH MÁLAGA
Outline
• Introduction
• Concepts and Background
• Current practices in Big Data analytics
• Semantic modelling
• Overall approach
• Validation: Case studies
• Discussions
• Conclusions
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions2
3. IATECH MÁLAGA
Introduction
• Motivation
• Gartner’s report: An emerging challenge in Big
Data is to construct data-driven intelligent
applications that capture and inject domain
knowledge in the analytical processes, including
context and using a standardized format
• Context refers to all the relevant (meta)-information
to support the analysis and to help interpreting its
results
• This will facilitate the integration (in a standardized
way) with third parties’ data, algorithms, business
intelligence (BI) and visualization services
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions3
4. IATECH MÁLAGA
Introduction
• Motivation
• The use of semantics as contextual information will
enhance the analytical power of the algorithms, as well as the
reuse of single components in data analytics workflows
(Ristoski & Paulheim, 2016)
• The development of ways to make the domain knowledge
explicit and usable is needed to improve the data processing
and analysis tasks
• The Semantic Web technology can be used to annotate not
only the knowledge domain of the data, but also the analytics’
meta-data (Keet, Ławrynowicz et al., 2015)
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions4
8. IATECH MÁLAGA
Introduction
• Motivation
• Companies have
already realised
• Administration
(European
Commission) too
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions8
9. IATECH MÁLAGA
Introduction
• Motivation
• Companies have
already realised
• Administration
(European
Commission) too
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions9
https://www.big-data-europe.eu/ http://www.bigdataocean.eu/
http://www.semagrow.eu/
10. IATECH MÁLAGA
Introduction
• Motivation
• Companies have
already realised
• Administration
(European
Commission) too
• In Academics we
aim at going one
step further
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions10
11. IATECH MÁLAGA
Introduction
• Motivation
• In Academics we aim at going one step further
• The Semantic Web technologies can be used to annotate not only the knowledge
domain of the data, but also the analytics’ meta-data, including: algorithms’ parameters,
input variables, tuning experiences, expected behaviours and taxonomies
• This will facilitate the reuse and composition of Big Data
analytics in a proper manner
• As well as to enhance the quality of consumed and produced
data
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions11
12. IATECH MÁLAGA
Introduction
• Hypothesis:
The semantic annotation of Big Data sources, components and algorithms can
acts as a link to capture and incorporate the domain knowledge to guide and
enhance the analytical processes
• In addition, the semantic annotation can provide the background for
reasoning methods based on axiomatic and rule logic recommendations
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions12
13. IATECH MÁLAGA
Introduction
• Proposal:
• Semantic model: ontology-driven approach to support knowledge
management in Big Data analytics workflows
• The proposed ontology is called BIGOWL (BIG data analytics OWL 2 ontology),
which acts as a formal schema for the representation and consolidation of
knowledge in Big Data analytics
• Knowledge incorporation is in turn beneficial for an efficient algorithmic performance, by
taking part in operator’s design, parameter selection, human interactive and decision-
making strategies
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions13
14. IATECH MÁLAGA
Concepts and Background
• Different sites and people will talk about everything from artificial
intelligence to natural language processing to linked data and the Semantic
Web
• What are they all?
• How do they relate to each other?
• How do they relate to you?
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions14
15. IATECH MÁLAGA
Concepts and Background
• Different sites and people will talk about everything from artificial
intelligence to natural language processing to linked data and the Semantic
Web
• What are they all?
• How do they relate to each other?
• How do they relate to you?
The Semantic Web, Web 3.0, the Linked Data Web, the Web of Data…whatever
you call it, the Semantic Web represents the next major evolution in connecting
information
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions15
16. IATECH MÁLAGA
Concepts and Background
• How is the “Semantic Web” Different?
• The word semantic itself implies meaning or understanding
• Semantic Web is concerned with the
meaning and not the structure of data
(such as, relational databases or
the World Wide Web itself)
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions16
17. IATECH MÁLAGA
Concepts and Background
• What Standards Apply to the Semantic Web?
• Mainly 4 technical standards:
• An Ontology provides a formal representation of the real world
• It defines an explicit description of concepts in a domain of discourse
(classes or concepts), properties of each concept describing various
features and attributes of the concept (properties) and restrictions on
properties
• RDF (Resource Description Framework): The data modelling language for the
Semantic Web. All Semantic Web information is stored and represented in the
RDF
• SPARQL (SPARQL Protocol and RDF Query Language): The query language of the
Semantic Web. It is specifically designed to query data across various systems
• OWL (Web Ontology Language): The schema language, or knowledge
representation (KR) language, of the Semantic Web
• OWL enables you to define concepts composably so that these concepts
can be reused as much and as often as possible
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions17
18. IATECH MÁLAGA
Concepts and Background
• What Standards Apply to the Semantic
Web?
• Imagine two relational tables of two
different databases: movies and cinema
rooms
• Imagine a program can automatically query
your Web site and any other site that has
movie scheduling information in order to
show a complete view in one place
The goal of Linked Data is to publish structured
data in such a way that it can be easily
consumed and combined with other Linked
Data
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions18
19. IATECH MÁLAGA
Concepts and Background
• How is the “Semantic Web” Different?
• Linked Data is the Semantic Web realized via four best practice principles
• Use URIs as names for things.
• An example of a URI is any URL
• Use HTTP URIs so that people can look up those names
• When someone looks up a URI, provide useful information, using the standards such as
RDF* and SPARQL
• Include links to other URIs so that they can discover more things
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions19
20. IATECH MÁLAGA
Concepts and Background
• How is the “Semantic Web”
Different?
• Once all the rows of our tables
have been uniquely identified,
made dereferenceable through
HTTP, and described with RDF, the
last step is providing links
between different rows across
different tables
• The main aim here is to make
explicit those links that were
implicit before shifting to the
Linked Data approach. In our
example, movies would be linked
to the theatres in which they are
playing
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions20
21. IATECH MÁLAGA
Concepts and Background
• How is the “Semantic Web” Different?
• Once our tables have been so published, the Linked Data rules do
their magic: people across the Web can start referencing and
consuming the data in our rows easily
• If we go further and link from our movies to external popular data
sets such Wikipedia and IMDB then we make it even easier for
people and computers to consume our data and combine it with
other data
• This provides our data with context
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions21
22. IATECH MÁLAGA
Current practices in Big Data analytics
• In current Big Data
technology ecosystems,
when facing a specific
data analytic task, it is
usual to support on
already existing tools
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions22
23. IATECH MÁLAGA
Current practices in Big Data analytics
• Besides technological or commercial aspects, current Big Data platforms still follow
the common procedure when facing data analytics tasks (ACM-SIGKDD, 2014), which
comprises typical steps of classical KDD:
• data collection,
• data transformation,
• data mining,
• pattern evaluation, and
• knowledge presentation (Visualization)
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions23
24. IATECH MÁLAGA
Semantic modelling
• The semantic model: BIGOWL
• Ontological scheme driving the whole process of Big Data analytics
• It is the terminological box (TBox) that defines the vocabulary with concepts and properties in the
domain of Big Data analysis
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions24
25. IATECH MÁLAGA
Semantic modelling
• The semantic model: BIGOWL
• It is the terminological box (TBox) that
defines the vocabulary with concepts and
properties (relationships) in the domain of
Big Data analysis
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions25
26. IATECH MÁLAGA
Semantic modelling
• The semantic model: BIGOWL
• It is the terminological box (TBox)
that defines the vocabulary with
concepts and properties in the
domain of Big Data analysis
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions26
28. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming processing of New York City traffic open-data
• Dynamic version of the bi-objective Traveling Salesman Problem (TSP), to minimize the
“travel time” and the “distance” to cover certain routing points in a urban area
• Open Data API provided by the
New York City Department of
Transportation
• Updates traffic information
several times per minute
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions28
29. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming processing of New York City traffic open-data
• Analyser: Multi-objective metaheuristic NSGA-II provided in jMetalSP. It which allows
parallel processing of evaluation functions in Apache Spark environment
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions29
30. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming processing of New York City traffic open-data
• Workflow
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions30
31. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming processing of New York City traffic open-data
• Ontology definition of this workflow
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions31
32. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming processing of New York City traffic open-data
• Workflow
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions32
33. IATECH MÁLAGA
Validation: Case studies
• Case study 1: Streaming
processing of New York City traffic
open-data
• Semantic annotation and querying
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions33
34. IATECH MÁLAGA
Validation: Case studies
• Case study 2: academic problem of Irish flower classification
• Classification algorithm: decision tree J48
• UCI Repository
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions34
35. IATECH MÁLAGA
Validation: Case studies
• Case study 2: academic problem of Irish flower classification
• For materialization, two different approaches have been used in this case:
• the well-known library for data mining Weka and
• the BigML SaaS API for analysis on-cloud
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions35
36. IATECH MÁLAGA
Validation: Case studies
• Case study 2: academic problem of Irish flower classification
• Ontology definition of this workflow
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions36
37. IATECH MÁLAGA
Validation: Case studies
• Case study 2: academic problem of Irish flower classification
• Analytic workflow
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions37
38. IATECH MÁLAGA
Validation: Case studies
• Case study 2: academic problem
of Irish flower classification
• Semantic annotation and querying
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions38
39. IATECH MÁLAGA
Validation: Case studies
• Case study 3: Reasoning
• SWRL rules to perform semantic reasoning jobs mainly devoted to check correctness of
workflows, e.i., to discover those components and tasks with (non-)compatible
connectivity of inputs/outputs, execution orders, data domains, data formats, data types,
etc
• SWRL rules are then evaluated by the reasoner after classifying Big Data components in
accordance with axioms
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions39
40. IATECH MÁLAGA
Validation: Case studies
• Case study 3: Reasoning
• SWRL rules to check correctness of workflows
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions40
41. IATECH MÁLAGA
Conclusions
• Experience in case studies revealed that BIGOWL approach is useful
when integrating knowledge domain concerning a specific analytic
problem
• Consequently, the integrated knowledge is used for guiding the
design of Big Data analytics workflows, by recommending next
components to be linked, and supporting final validation
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions41
42. IATECH MÁLAGA
Research agenda
• First phase to provide automatic facilities for ontology population, hence to enrich
the semantic approach
• To generate new and heterogeneous use cases of analytics workflows that would led
us to find and solve new possible deficiencies, as well as to enrich the knowledge
base
BIGOWL: Using Semantics to Develop Big Data Analytics Solutions42
43. IATECH MÁLAGA
BIGOWL: Using Semantics to
Develop Big Data Analytics
Solutions
José Manuel García Nieto
jnieto@lcc.uma.es