Hybrid use of machine learning and ontology

The following presentation was initially delivered at Dataversity’s Enterprise Data World
(EDW) 2018 conference held in San Diego, CA, in April 2018.(EDW) 2018 conference held in San Diego, CA, in April 2018.
1

To get started this presentation provides some background on two of the primaryTo get started this presentation provides some background on two of the primary
approaches behind building intelligent systems, specifically INDUCTIVE approaches that use
statistical machine learning and CONSTRUCTIVE approaches that use formal semantic
models such as taxonomies or ontologies. It also covers the strengths and weaknesses of
both approaches. Then it moves on to the main goal of the presentation which is to make
the argument that there are substantial benefits to using a combination of both approaches
together, working as complements to each other -- basically a hybrid or blended approach -
- versus using either one of the approaches on their own. In the final segment the
presentation covers how the two approaches can be used together based on some specific
workflows and examples, and discusses some tools and other resources that can be used to
support a hybrid or blended approach.
2

Beginning around 2011 or 2012 there was a renaissance of interest in the field of AI. BigBeginning around 2011 or 2012 there was a renaissance of interest in the field of AI. Big
data was everywhere. A new breed of scientists, called data scientists, started using
sophisticated statistical algorithms to induce patterns from these large data sets to do
things like pick out keywords or themes from texts, or objects and people from images.
Some of the data sets were used as representative data sets to train algorithms to do even
more sophisticated pattern matching and even predictions. These statistical patterns
rapidly replaced explicit knowledge models. Amazingly, nerdy, geeky data scientists became
sexy! And before you knew it, machine learning effectively became what we know today as
AI. Basically you could say that “ML ate AI”.
3

Those of you who weren’t in this field prior to 2011 might well ask: how were intelligentThose of you who weren’t in this field prior to 2011 might well ask: how were intelligent
systems built before ML came along, or were there even such systems before ML? The
answer is, yes, there were some intelligent systems before ML came along. They were
typically built on top of explicit knowledge models – models that were pretty much
constructed by hand, either individually or as group or community efforts where the
models were sort of ‘crowd-sourced’. There were actually lots of these types of efforts
going on in different contexts. For example, library science classification systems and things
like thesauri were used to annotate documents and datasets with semantic metadata. Data
dictionaries, including conceptual schemas, were used within the database community to
model enterprise data and help solve data integration problems. As the Internet and World-
Wide Web emerged, the Semantic Web community produced standards like RDF and OWL
targeted at organizing and connecting the vast amounts of data being hosted on the Web.
And for many years both academic and business groups have worked on various formal
knowledge models or ontologies, including specific domain models, core or foundational
ontologies (sometimes called upper-level or top-level ontologies), and even some
commonsense ontologies, for example Cyc/Open Cyc.
4

One way of thinking about formal knowledge models is that they are intended to be sort ofOne way of thinking about formal knowledge models is that they are intended to be sort of
like the Fortress of Solitude from Superman, where all the knowledge of the universe is
stored in ice crystals. Only in this case the knowledge is encoded using enterprise modeling
tools, content management systems or various types of database management systems.
Unfortunately, the formal knowledge models created prior to the rise of ML-based AI never
quite matched the Fortress of Solitude. But on the plus side, if the modeling process is
thorough and the models are done well, these models explicitly represent the real-world
knowledge of domain experts in a way that is both formally rigorous and machine-readable.
In fact, since the models are hand crafted usually by expert knowledge modelers working
with domain subject matter experts, and particularly if they are developed by community
groups in a collaborative way, they are often very good representations of the concepts,
facts and rules about the domain. Standards-based models can be shared among groups
and connected to, or integrated with, other related models. So with all this upside, what
happened? Why didn’t this approach work?
5

Despite the many pluses, there were unfortunately a lot of minuses, too. First andDespite the many pluses, there were unfortunately a lot of minuses, too. First and
foremost, formal knowledge modeling is simply very hard to do and because of that, and
the dependency on humans (even if that means groups of humans), this approach is not
easily scalable. There were endless turf wars and format wars, including among the
standards communities. Also, it can be difficult to use interim work products, because they
may not be complete enough in all the necessary areas you need for your application or the
models may be revised as a result of fleshing out the details. So unfortunately the models
were often not Done until they were 100% complete, or at least the perceived risk was high
enough to discourage their interim use. Related to that, it’s hard to gauge what constitutes
Done and to know when to stop. So some of the efforts seemed to never come to an end.
Even with a machine-interpretable format, making the models accessible to applications
was problematic. It was hard to work directly with formats like RDF triples, and there were
few if any triple stores, graph databases or other semantic network databases, at least
none that were performant and scalable enough, so figuring out how to store the models
and expose them via efficient APIs or other access mechanisms was a challenge. And lastly,
if it wasn’t hard enough to build the models once, the effort and cost of maintaining them
by hand over time, at least for any domain that wasn’t largely static, was often prohibitive.
6

One of the big attractions then of using a statistical-based approach that can be automatedOne of the big attractions then of using a statistical-based approach that can be automated
in ML algorithms is that it’s much easier to do. Some people think it is almost like having an
easy button. It’s much more ad hoc than the formal ontology-building approach and the
process goes much faster. All you need to do is apply some math to data and you have AI.
There’s no major need for experts (although some may be needed to QA the results) and
best of all, there’s none of that tedious formal modeling. The work products are relatively
dynamic and easy to maintain in the sense that as things change and new data becomes
available you can rerun the process again and produce a new output, or you can even
stream the data and apply updates continuously through background processing. You also
don’t need to necessarily make the models explicit. You can just use the statistical patterns
to organize the results, for example through indexing. The indices serve as a sort of de facto
model -- think about the indices behind Google Page Rank as a model of how the contents
of various web pages are related to one another -- and that de facto model may be ‘good
enough’, depending upon the use case such that you don’t need a formal model.
7

No wonder this approach replaced the constructive one, right? But unfortunately machineNo wonder this approach replaced the constructive one, right? But unfortunately machine
learning is not a panacea either. Don’t be fooled into thinking that statistical patterns or the
indices created from them are the same as formal knowledge models. They aren’t. Most
ML-based AI systems are ‘dumb’; that is to say they have no real access to knowledge about
the domain they are operating in. They are basically just ‘doing the math’. For some use
cases, particularly in domains like healthcare or finance, this ‘good enough’ approach often
isn’t. You need more formal models with semantic rigor. So there’s a risk that automatically
defaulting to models based on machine learning will end up being mismatched or
misapplied, based on the level of precision needed for the specific use case. At a minimum,
this needs to be deliberately considered when deciding what approach to use.
There are also major problems with not having explicit models: if the patterns aren’t
explicitly modeled, that makes it very difficult to do QA on the results of the ML process. Of
course you can run known test data through the application and see if you get the desired
results. But you can’t really test for the flipside -- the possibility of undesirable results.
There are many well-known horror stories about some of the embarrassing results Google
and others have unintentionally produced using machine learning algorithms. It also makes
it hard to do back-tracking or traceability of results to underlying rationale in the data
model. If you’re developing a physician’s assistant application and you can’t backtrack to
the basis for a patient’s diagnoses and the evidentiary source data to support that, you’re
probably not going to have your application accepted for use in the medical community.
And lack of traceability can also be a problem given the new General Data Protection
Regulation (GDPR) that just came into effect in the EU.
8

While people may think this approach is easy to do, it is more accurate to say that it is relatively easier than
formal knowledge modeling, athough as noted you don’t necessarily get an equivalent result. But even
leaving that aside, ML-based projects are no cakewalk either, and that’s maybe one of the biggest fallaciesleaving that aside, ML-based projects are no cakewalk either, and that’s maybe one of the biggest fallacies
about them. Not only do you have to collect and manage large amounts of data, but a lot of work is
required to prepare the data for use in processing. A data engineering platform company called
Astronomer.io says that 80% of data scientists’ time is spent identifying the appropriate data sets and
preparing the data for processing. Sometimes there is too much data – call it the ‘garage or attic or
basement problem’. Many companies are collecting as much data as they can on the assumption that one
day they’ll find a use for it. Some of it may simply be bad data, or disorganized or at least not well
documented. For training of supervised learning algorithms in specific, data has to be not only clean, but
annotated with at least some guiding metadata. In other cases, particularly for small and medium
businesses or new starts, there may not be any data, at least not to start with -- so you get the ‘cold start
problem’ of building up your data before you can really do anything. And sometimes the biggest and best
data sources are owned by major players like Facebook, Google and Amazon, so the data is essentially
‘locked up’ unless they decide to sell it to you or open it up to the public. And as we know from recent
events, making such data available to 3rd parties can pose its own problems. Lastly, sometimes data
collected with a given purpose in mind is not representative for another intended use or contains hidden
biases that may affect results.
Apart from the challenges with the data, there are challenges with the ML algorithms, too. There are so
many these days it can be hard to know which one or ones to use for your particular use case. A year or two
ago it was common for people to do a lot of trial and error, but there are more examples being shared now
of successful approaches and algorithm suppliers are getting better at recommending which of their
algorithms to use for which types of uses. Some are even trying to use AI to recommend the best algorithm
to use given the data set and use case. In terms of suppliers of algorithms, there are a lot of commercial
and open source algorithms available today, but there’s often still the question of whether any are well-
suited to your specific needs or whether there’s business value to your company in developing your own
custom algorithm versus using the same ones your competition might be using. The algorithms themselves
can also be quite complex. Between challenges with the data and the algorithms, it’s pretty clear why we
need Data Scientists to do this. work.
9

You may have noticed that the strengths and weaknesses of the statistical ML approach and the formal semantic or
ontological approach are in a lot of ways natural complements of one another. So it makes sense to try to combine the
two approaches in a way that plays to their respective strengths and mitigates or eliminates their weaknesses. In fact,
the idea of a hybrid or blended approach isn’t really a new idea -- some of us were talking about it back in 2010 or 11.the idea of a hybrid or blended approach isn’t really a new idea -- some of us were talking about it back in 2010 or 11.
It’s just kind of been forgotten in the tidal wave of uptake of machine learning. But if you look inside the machine
learning process itself, there are often hooks for semantic metadata. It is possible to induce or identify statistical
patterns without understanding what they mean. But to make use of those patterns, as noted earlier, training data sets
are often created that are annotated with some level of semantic metadata, such as labeling images for object
recognition systems. You can find patterns of similar objects that have 4 legs, whiskers and pointy ears, but unless some
training image tells you that’s a cat, or there’s text mixed with the images and you can correlate the appearance of that
object to the appearance of the word Cat, you don’t know what object you’ve identified. Annotation is a partial step, but
richer semantics are needed, both for image analysis and text analysis. Knowledge graphs are a bigger step in the
semantic direction.
The major AI players -- the FAANG/FAMGA companies -- all have or are developing knowledge models or knowledge
graphs to use along with their ML algorithms. To Google’s credit, they started down this path early on. In 2010 they
bought Metaweb and used its Freebase knowledge base as the starting point for what we now know as the Google
Knowledge Graph, which in turn since the Hummingbird algorithm release in 2013 has made Google searches into
semantic searches, well beyond Page Rank’s original link-based algorithm. Google was also an early and strong
proponent of using Schema.org metadata to annotate web pages. Microsoft has also had versions of its Satori
knowledge graph sitting behind Bing for several years and it’s now used across the product suite, including by Cortana.
Facebook famously or infamously launched its entity-based Graph Search back in 2013 to add semantics to Facebook
content, but it struggled out of the gate. It still appears to exist behind the scenes and there are rumors that it is being
re-invigorated to support M as well as another virtual assistant that Facebook is purportedly working on. Netflix
developed a quite rigorous entertainment content classification scheme using human curators, which was initially rolled
out around the end of 2013 and which is now at least semi-automated using ML. Apple is a relative latecomer, as they
are just now developing a knowledge graph for Siri, based on a combination of induction and curation. It’s a challenge
for them because they don’t have a search engine like Google or Bing to collect lots of user data to mine, but they do
have a lot of data from Siri questions. Lastly is Amazon. They have a lot of product and consumer purchasing data, and
some data now from Alexa interactions, but at least up until recently there wasn’t a knowledge graph behind Alexa.
There is some technology called True Knowledge that helps with NLP parsing of questions, but if you’ve ever tried to find
a particular Skill out of the 10s of thousands of Skills Alexa currently has, you know Amazon could benefit from having a
knowledge graph. And it’s a pretty sure bet they are working on one.
10

Unless you work for one of the FAANGs or FAMGAs you might be asking if the hybridUnless you work for one of the FAANGs or FAMGAs you might be asking if the hybrid
approach is relevant to you, or if it’s just for the big players like them. Part of the reason for
this presentation is to emphasize that it isn’t -- or shouldn’t be -- just for them. You can use
this approach, too, using the data you have -- which is probably highly-targeted to your
specific customers and use cases -- augmented by open or public data sets and tools.
So what are some practical ways to execute a hybrid approach? First, it is important to be
like Goldilocks or Einstein when you are doing this. In other words, you really need to find
the sweet spot at the intersection of effort and complexity relative to benefits. This
presentation is focused mostly on text analysis. So let’s start by considering all the various
aspects of statistical text analysis. That includes things like terms and term frequency, entity
and named entity extraction, document clustering, topic identification (sometimes called
topic modeling) and document classification. Text analysis is based on one simple principle:
you can tell a lot about a document by the words that appear in it.
11

To start this process you really only want to focus on semantically significant words. So theTo start this process you really only want to focus on semantically significant words. So the
first step is to get rid of the insignificant words -- many purely syntactical -- that create a lot
of unnecessary noise. Those are often called stop words. Typically 100-200 stop words is a
good rule of thumb. It’s pretty easy to find candidate sets of ‘standard’, commonly used
grammatical or syntactical stop words on the Web. But you may also want to consider
words that are specific to your domain of interest, but that appear so commonly in your
documents that they effectively become ‘noise’, almost like stop words, for example terms
like Health or Medical in the healthcare domain. You do have to exercise caution here
though, as some of those common terms are used as qualifiers or modifiers of other terms
in compound words and phrases, so they provide a sort of ‘connective tissue’ that can be
important. If you find a few of these words getting in the way of meaningful analysis, you
can always add them to your stop word list.
12

For Step 2, run a statistical clustering algorithm on the documents in your data set to groupFor Step 2, run a statistical clustering algorithm on the documents in your data set to group
them naturally based on the similarity of what they are about. This is a good way to check
for overall coherence of documents within your data set, paying attention both to major
clusters as well as outliers. Note that a document or group of documents that sticks out
from the others like a sore thumb may not belong in the same data set. So there may be an
opportunity for some data cleaning. If the documents in the data set are all very similar,
doing document clustering based just on similarity alone may not be so useful. You may
need to employ a different statistical algorithm like TF-IDF (term frequency-inverse
document frequency) which is intended to differentiate the documents by emphasizing
what each document or group of documents is about that the other documents are NOT
about. In other words, emphasizing uniqueness or rarity instead of just similarity.
13

The clustering in Step 2 was statistical, whereas in Step 3 the objective is to do someThe clustering in Step 2 was statistical, whereas in Step 3 the objective is to do some
analysis of the terms behind the statistics. This can be done across the entire data set,
within major (statistically significant) document clusters, or both. One benefit of this is
another layer of coherence check, in this case adding human-based analysis of the terms to
better understand the semantic basis behind the statistical clustering. It also provides
insight into reasons for the outliers. As a simple example if 90% of the documents clustered
around health conditions and 10% around nutrition, and the focus on your use case was
directly on health conditions, you might be able to eliminate the 10% of the documents
that don’t really relate directly to that topic. But since nutrition, as well as genetic and
environmental factors, can cause or affect health conditions, there might be use cases
related to causes or mitigations where including nutritional-related data might be
important. Common terms usually indicate key topics, but it’s not always that clear cut, so
beware. Just because a term occurs commonly, doesn’t mean that is semantically the key
topic. A document could use the terms Rules or Regulations a lot, but really be
thematically about the concept of compliance. And a document about regulatory matters
or regulations might not use the exact term Regulations a lot, but instead have many
occurrences of related terms like Standards, Requirements, Guidelines, Compliance and
Governing Authority, which only taken as a set of terms and understanding their
relationships would you know the document was about the concept of regulation. So
without having a semantic model -- the other half of the equation we’ll talk about next --
just using statistical term frequencies can sometimes be insufficient or even misleading in
some cases.
14

Having analyzed the most common terms in the largest clusters of documents, in Step 4 we lookHaving analyzed the most common terms in the largest clusters of documents, in Step 4 we look
down deeper and at the edges. We can use TF-IDF to focus on differentiating terms. We can also do
more specific positional analysis to focus on clusters of terms that appear together within say 20
words down to 5 words of one another. We can look at Part of Speech (PoS) relationships for words
that appear very frequently and very closely together. Nouns are certainly a primary focus in
semantic analysis because they represent entities, but it can also be useful to look at the roles
nouns play, as well as verbs, which are important for process or task-oriented analysis. So this
means looking at subject-predicate-object relationships (as in RDF triples, Conceptual Graphs and
other predicate logics) , as well as adjectives and other qualifiers that can be very important for
sentiment analysis use cases. Another interesting thing to look at is term substitutions, where a
group of related terms appear frequently, but one member of the group changes. These
substitutions can indicate important semantic relationships. The changed terms may be synonyms
or variants e.g., “If no physician is available, a nurse practitioner may consult with the patient and
approve renewal of the prescription”, “If no doctor is available, a nurse practitioner may consult
with the patient and approve renewal of the prescription”. But substitutions can also indicate
moving along a generalization/specialization or class/subclass hierarchy, for example “Nurse” might
be used generically in some cases, but in other cases appear as “Licensed Practical Nurse”,
“Registered Nurse” and “Nurse Practitioner”. In that last case, Nurse Practitioner might appear with
the terms Diagnosis and Prescriptions whereas the other two types of nurses don’t, based on the
fact that among nurses only Nurse Practitioners can perform diagnosis or prescribe medication. Or
substitutions can be different instances of a given entity or class, e.g., “Physicians in the state of
Texas are required to update their provider information bi-annually using Form 1358B” versus
“Physicians in the state of California are required to update their information annually using Form
1358B”, where Texas and California are instances of the entity U.S. State. It’s interesting to note
that opposites don’t tend to fall into this pattern, as they aren’t typically surrounded by the same
terms as their antonyms unless on occasion when accompanied by negation.
15

So how do you get the data for use in this kind of analysis? Data analytics-orientedSo how do you get the data for use in this kind of analysis? Data analytics-oriented
programming languages like Python and R have libraries of statistical functions that can be
utilized with document-based data sets, as do ML platforms like Apache Mahout and
Google TensorFlow. There are many of these available, including in the Amazon AWS,
Microsoft Azure and Google clouds. For example there are k-means clustering and term
vector cosine coefficient algorithms that can be applied to cluster documents and terms, as
well as algorithms for topic identification and document classification. The open source
Apache Lucene tool is one of the best tools for producing detailed term vector data for
documents and using that to index the documents. Then associated search tools like
Apache Solr and Elasticsearch can be used to search the documents based on the term
indices. There are commercial versions of these available from mainstream vendors like
IBM with its Retrieve and Rank component in Watson BlueMix, LucidWorks, Elastic.co (the
company behind Elasticsearch), as well as many other vendors who have their own
proprietary text analysis algorithms. Apart from cost differences, it’s a good idea to check
whether you get access to the detailed term vector data for your own algorithmic use and
whether/how well the tools integrate with other complementary tools from other
commercial vendors or open source projects. Complete term vector component data
means for each document and in aggregate for the data set as a whole, and not just the
basic TF-IDF statistics, but the full gamut of term frequency stats, including each occurrence
of each term in the document and its offset, or position within the document (so you can
do the statistical analysis discussed earlier). To do that, you can export the term vector data
from Lucene/Solr and import it directly into Mahout for detailed statistical analysis of term
vectors, or use the exported data set as input to any of the other algorithmic packages.
16

Analyzing the data just by looking at the statistics can be pretty mind-blowing, so it is good to be able to visualize the data.
You might want to look at term clusters or sets of graph vectors depicted in various ways. You can start with simple termYou might want to look at term clusters or sets of graph vectors depicted in various ways. You can start with simple term
clusters depicted using tag clouds or heat maps. There are lots of open source tools to generate tag clouds or heat maps,
although you may need to involve a skilled UX developer to put the pieces together and make things visually meaningful.
There are rigorous ways of analyzing relationships among terms using vector math algorithms from Mahout, R and other
tools, but if you want to do something less rigorous to start you just need to decide on a standard way for calculating this
and apply it uniformly, as it isn’t necessarily exact numbers but relative numbers that are useful for visual semantic
graphing. So for any two terms in a document, you get their affinity or ‘correlation score’, and you can then use that to
drive visualizations. For example, if the term Provider appears 53 times and the term Medicare appears 45 times and they
co-occur 37 times in some given proximity, say within 5 or 10 words, you can use that data to determine how strongly
correlated they are (in a given document, across a cluster of documents or for the entire data set). For this sort of visual
analysis, there’s a great open source graphing tool called Gephi. Again, you’ll probably want to pull in someone with UX
expertise. To use Gephi you would go through the following steps.
1. Get the top N terms for a given document (there’s a Solr term frequency API call that returns this data by
document, but it is also part of the complete term vector component, too, if you get that data by document, since
you’ll need it elsewhere)
2. If you only want entities, run OpenNLP, Stanford NLP or some similar Part of Speech (PoS) analyzer/tokenizer to
identify the terms from the top list that are nouns (entities)
3. Get the complete Solr/Lucene term vector component for the document via the API call (including term
frequencies and positions/offsets for each occurrence in the document)
4. Filter it down to just the entities from Step 2
5. Calculate the term ‘relatedness’ affinities by looking at each pair of terms and their position in the document (for
each occurrence of the terms). You might have to experiment a bit, but reasonably, related terms could be
expected to be within say 5, 10 or 20 positions of each other
6. Prepare the data for input into Gephi. There are lots of input formats, but CSV is a common one. Put the term
pairs in a CSV file as nodes and graph edges using their affinity as the edge weight (see Gephi data input formats).
Term frequency can be used to determine the size of the node and associated node label (term name) and co-
occurrence can be used for the length of the arc or edge between nodes.
17

While Gephi is useful for visualizing graph data, it is a visualization tool, not a graphWhile Gephi is useful for visualizing graph data, it is a visualization tool, not a graph
database. Graph databases create and store data relationships as nodes and edges,
including the graph distance along the edges between the nodes. Some of the open source
and commercial databases can directly import the detailed Lucene term vector component
data to create internal graphs of the term vectors. Some examples of graph databases are
Neo4j, which has both an open source and commercial version, Blazegraph, GraphPath,
Dgraph and at least to some extent also Stardog and Metaphacts GraphScope. The latter
two are more oriented to RDF triples, but may work with Lucene data as well, or you may
be able to convert. One good thing to note is that graph databases often have their own
visualization tools, too, so you don’t need a separate tool for that. One of the most useful
aspects of graph databases is the ability to easily do graph proximity or graph distance
calculations to analyze the ‘relatedness’ of terms, as a way of beginning to get into the
formal space of knowledge graphs, as discussed earlier in the presentation.
18

Another way of getting more formal about the semantic or ontological relationshipsAnother way of getting more formal about the semantic or ontological relationships
underlying the statistical data about terms and their appearance in documents is to
introduce facets. In general, faceting is a way to annotate and organize documents. You can
think of facets as topographical features that form planes in a graph vector space of
document texts. In this case the Solr/Lucene tools support the use of facets by explicitly
allowing you to denote some of the terms from the documents as organizing terms or
facets, basically adding semantic structure to the indices. In Solr/Lucene the facets serve as
secondary indices which are then used to boost search results, so it’s in effect a way of
making Solr searches into more semantic-based searches. The terms you use for faceting
should be key terms such as the most common, the most unique or differentiating, or some
combination of these two. The Solr/Lucene indices, including facets, can be exported and
imported into semantic modeling tools such as Apache Stanbol, where you can then begin
to get even more rigorous about building knowledge models.
19

The relationship between statistical analysis and formal knowledge modeling works both ways. We’ve been looking a lot at
how to use the statistical data as a starting point for more formal knowledge modeling. But if you have formal knowledge
already encoded in some form, there are ways you can use that to augment statistical analyses. Ontologies model the
concepts behind concrete things in the real world as well as abstract concepts. As noted earlier, one of the problems withconcepts behind concrete things in the real world as well as abstract concepts. As noted earlier, one of the problems with
building ontologies is that they can be quite large and complex. Because of that, a lot of knowledge models are taxonomies,
which are a subset of ontologies focused on entities organized into hierarchies or trees. For instance, as we saw in earlier
examples, there are generalization/specialization or class/subclass hierarchies and type/instance or class/instance
hierarchies. Of course these can be represented as graphs, with the entities as the nodes and the hierarchal relationships
as edges. Sometimes the most interesting data you want to analyze and understand actually are so-called long-tail data
patterns involving lattices formed by complex graph intersections, truly giving meaning to term ‘edge cases’. Imagine a
hierarchy structure involving fishing, stream fishing, and mountain stream fishing and then navigating through graph
relationships to connect these concepts to environmental concepts such as the effects of acid rain and runoff/erosion on
mountain stream water quality and its associated biological impact on fish and fishing. Important data may remain buried
or hidden ,or edge cases may go unexplained ,if you’re only looking for statistical patterns in large data sets based on the
most frequent, common items. There are lots of use cases where you want to really dig in and better understand the data
behind statistical models. For example, if you are a lawyer doing M&A due diligence you might want to look for needles in
the haystack that can either raise a red flag for a potentially bad deal or expand business opportunities around a deal
beyond the value generated through more obvious channels or markets. Or think about the value of identifying voters in
Midwestern and Southern swing states who are political independents or ‘blue dog Democrats’ and have filed for
unemployment in the past few years, and understanding how their concerns around jobs, trade and immigration might
affect voting. Sometimes relatively small population segments can make a big difference, outsized to their overall statistical
proportion. So while you need to keep your eyes on the big picture, edge cases are often important, too. You may not even
have an understanding of where those might exist in the data if there isn’t some organizing or guiding metadata to give
context and meaning to the raw data points. You can’t correlate data to produce more complex data points, and even if
you do correlate the data you may not understand the nature of the correlation unless there are some sort of semantics
associated with the data. News articles back in 2008/9 about the conflux of debt aggregation and reselling, foreclosures,
risky derivative markets, etc. might have indicated a red flag if there had been a knowledge model about such connections
and risks, which there hopefully is now. It might also be interesting to understand what things are said about companies in
the press before their stocks rise or fall (predictive sentiment). So knowledge models or ontologies can really give context
and meaning to data. Think of it this way: what if statistical models could talk -- what if they could tell you about the
meaning of the patterns in the data? Semantic models such as ontologies can give a voice to the patterns discovered in
statistical models. They can help data scientists and their business counterparts to work collaboratively, ask meaningful
questions to drive the models, and dig more deeply into the patterns that appear in statistical models.
20

If you have an ontology, even a relatively simple taxonomy, it can be useful in a number ofIf you have an ontology, even a relatively simple taxonomy, it can be useful in a number of
ways to help filter and organize statistical data about text documents, in particular term
vector data. Some enterprises have their own proprietary models that they’ve developed
themselves, sometimes for other uses like organizing digital document libraries or the
content in content management systems. Those can be reused for this purpose. There are
also commercially-available proprietary models. But of more interest perhaps are the many
public models, including lots of Linked Open Data models (LODs) that were produced by the
Semantic Web community from the late 90s up to the mid 20 teens, and even into present
day. According to the University of Mannheim, which helps track the models, there are over
a thousand of them for various domains. There are also many models from business and
academic communities such as The Object Management Group’s (OMG’s) Financial Industry
Business Ontology (FIBO) and the SNOMED healthcare terminology model. These
knowledge models can for example be used to guide and validate the clustering of
documents or terms from a data set. They can also be used to identify which terms are key
terms for populating in as facets in tools like Solr/Lucene.
21

Even if there isn’t an existing ontology, there are some tools and techniques that can helpEven if there isn’t an existing ontology, there are some tools and techniques that can help
you get at least part of the way there. Primal is a startup that provides a commercial
RESTful API that generates or synthesizes a taxonomy of related terms based on Yago
(which is a machine-readable form of Wikipedia data), WordNet and Wiktionary. It returns
the results as structured JSON for any given set of terms or topics, including long-tail topics.
Alternatively, you can use an approach called emergent semantics to expose over time the
semantics in various textual data sets by analyzing not just the text itself, but existing
metadata associated with documents in the data set, such as data types, data sources,
application usage data, time, geo-location and other contextual data, tags, keywords and
other document markup. Think of it as semantics mining done continuously in the
background to build models of your data and associated relationships.
22

Whether you start with an existing ontology or initially build one based on the output ofWhether you start with an existing ontology or initially build one based on the output of
statistical analyses, you can use various semantic tools and platforms to refine those
ontologies and use them again in the next iteration through the process. This closed-loop
iteration can be very effective over several cycles. There are lots of open source and
commercial semantic tools and platforms. For NLP and extraction of entities/named
entities there is the open source Apache Stanbol semantic platform, as well as the Watson
Natural Language Understanding Platform, part of IBM Bluemix, that does similar things
from a commercial product standpoint. Then there are tools that add annotation and
knowledge graphing. Ambiverse, a commercial startup spun-off from the Max Planck
Institute for Informatics and founded by the team that developed Yago, extracts named
entities, annotates them with Yago-encoded knowledge and generates an associated
knowledge graph. And to go all-in there are full ontology building tools like Stanford’s
Protégé tool, which is open source, as well as many similar commercial products such as
The Semantic Web Company’s PoolParty Semantic Suite.
23

Regardless of the tools you use for ML-based statistical text analysis and ontologyRegardless of the tools you use for ML-based statistical text analysis and ontology
modeling, you’ll need to use and extend your existing data governance policies, processes
and tools, where necessary, to help you organize, manage and secure the source data sets,
knowledge models used as part of the process, and the resulting ‘smart data’ outputs, as
part of your overall data governance strategy. Again, in the EU at least, GDPR comes into
play here, too. It’s a pretty safe bet to anticipate that there will be myriad new challenges --
many of which we haven’t even identified yet -- around things like privacy, the accuracy
and/or biases inherent in this data, new uses of smart data, etc. So my recommendation is
to give data governance a seat at the table in these activities from day one.
24

In conclusion, what are the benefits of this hybrid or blended approach and why is itIn conclusion, what are the benefits of this hybrid or blended approach and why is it
important? On the one hand, we get improved relevance -- truly smart data that’s better,
more traceable and more explainable in terms of the rationale for decision-making than we
would otherwise get from purely ML-based statistical approaches alone. That’s because the
addition of semantics provides better validation and more context or rationale behind the
data. As noted before, that’s not just useful to have, but may be required in some cases,
such as for the EU’s General Data Protection Regulation (GDPR). On the other hand, we can
create knowledge models and produce smart data outputs much faster and more
dynamically than if we used manually constructed ontological approaches alone. The ML-
based statistical data drives the rapid development and extension of semantic models such
as ontologies. This hybrid or blended approach will be critical for future intelligent systems
where smart data serves as the basis for predictions and decision-making, in addition to
what’s common today for informational purposes, recommendations, education or
entertainment. This will be even more important if future task-based intelligent systems
take proactive actions autonomously based on those decisions. In other words, the hybrid
approach will not only improve the building of today’s assistive or augmentative intelligent
systems, but also lay the groundwork for the next generation of even smarter, cognitive
computing systems.
25

Hybrid use of machine learning and ontology

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hybrid use of machine learning and ontology

Ähnlich wie Hybrid use of machine learning and ontology (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hybrid use of machine learning and ontology