1. Essential Engineering Intelligence
Adventures in Constructing an
Engineering Domain Model
What The IET learned when we decided to
‘semantically enrich’ ours and others data and
prototype some concept products…
1
May 2014
Alison Haggar, Product Manager
SSP Annual Meeting Session [Preparing for Tomorrows Stakeholders
Today: Wear it, Map it, Augment it!]
2. Essential Engineering Intelligence
What we did
2
Sector coverage Renewable energy
Market coverage
Worldwide for research trends app, UK only for other
apps
4 x use cases
Academic Researcher, Product Development Engineer,
Product Manager, Consultant
4 x work flows One per use case
4 x applications One per use case
IET data feeds Inspec
3rd Party data feeds
Emerald Group Publishing Limited, RenewableUK,
Ofgem, CambridgeIP
Data model
Based on Inspec thesaurus (controlled and uncontrolled
terms)
4. Essential Engineering Intelligence
A Domain Model for
Engineering
4
People
Organisations
Inspec
Wind
Turbines
Email
Job title
Address
Name
website
Name
Develop / are
developed by
Employ /
work for
Specialise in / are
developed by
9. Essential Engineering Intelligence
Visualisation: Trend analysis
9
Q. Has research into ‘solar
absorber-convertors’ peaked
or is it still growing?
Q. Has research into areas
closely related to ‘solar
absorber-convertors’ peaked
or is it still growing?
10. Essential Engineering Intelligence
The Renewables
Directory
Companies categorised
as developers,
manufacturers,
installers and/or
suppliers
Filters appropriate to
the technology
Social media
10
12. Essential Engineering Intelligence
Not only products… but Services
‘Self service’ enrichment modules
A hosted enrichment service
Client data analytics
Engineering Market
analytics
Knowledge Hubs
12
13. Essential Engineering Intelligence
Publisher Applications
Get statistical support for your content acquisition strategy by:
Identifying influential and emerging research trends
Visualising journal strengths and weaknesses over time
Comparing one journal to another
Comparing your publications to other publishers’ content to identify
USPs
Find and contact potential peer reviewers:
By specialism, geography and influence + continue to monitor
suitability
Keep your authors up-to-date:
Real-time notification of new citations + annual citation statements
Increase your citation statistics by:
Suggesting possible citations to researchers
13
14. Essential Engineering Intelligence
Final thoughts
At the IET we believe that
1. The future of indexing is Expert Curated Domain
Models
“The future of data curation is a competition between
information graphs” Sayeed Choudhury, Associate Dean for
Research Data management, Johns Hopkins University
2. The IET Engineering Domain Model is a
massively powerful tool
“Your data must out-perform the sum of its parts! And
produce solutions – not just more questions” D R Worlock,
Digital Strategy Advisor and Consultant
14
15. Essential Engineering Intelligence
Be Part of our Journey
Product Manager: Alison Haggar
ahaggar@theiet.org +44 1438 765611
“If you have engineering content then we can help
you to discover its hidden potential”
15
Hinweis der Redaktion
Good morning, as it says on the slide, my name is Alison Haggar and I’m here today to tell you about an adventure we’ve been on at the IET for some time now.
I joined the IET’s Knowledge division towards the back end of 2012 with a brief to look at new ways of presenting knowledge to both our current customers and to potential new customers.
The team had already done quite a bit of research and we therefore already knew that what users wanted were detailed answers to specific questions rather than links to lists of documents containing information that might or might not answer their questions. And they wanted to achieve this using an interface that had as near to Google simplicity as possible. We also had quite a lot of content – 14 million + records in Inspec alone.
The next question was how to marry the two without creating ‘just another website’ – very quickly therefore, I focussed in on natural language processing and semantic enrichment technologies.
Now the ultimate goal was to be able to answer questions across the whole field of engineering and technology – but I felt that this breadth of scope was partly the reason for a lack of progress to date. We needed to focus in on a small area of engineering, build something, play with some data, answer some specific questions and then work out how to scale this up to a wider audience.
So we reduced to the scope to renewable energy and focussed on content providers who were geographically close and therefore easy to work with face-to-face. We also selected 4 representative user types: academic researchers and product development engineers who fall within our current customer base and product managers and consultants who don’t. We then carried out a series of in depth interviews with representatives of these users to find out what information they needed to support them in their roles and which bits they currently find difficult to source.
The research was used to spec up a prototype with some very specific goals:
The first being to test natural language processing and semantic enrichment techniques against manually curated data to see how well the reality actually lives up to the hype
Secondly, to investigate data storage, the building of an API and a re-usable front end to help us assess production costs for a commercial implementation
Thirdly, to investigate what mix of content we would need to answer the questions our research had thrown up? How easy that content would be to source and also what it would cost to source.
And last but certainly not least, how best to make the products and services uncovered during the prototyping process available to end users.
And this is a screen shot of the prototype we built. Specifically, it shows a UI that allows people to select the market vertical they work in and then their job role. Once selected, these criteria are used to both interpret search results and to provide answers to the following questions:
Who is carrying out research into technology x (and related fields)?
Is research into technology x (and related fields) increasing or decreasing?
What companies specialise in technology x near location y?
What companies near location y operate/develop/manufacture/install technology x?
Is location y suitable for installing renewable technology x?
Can you show me information about operating renewable technology x installations near to location y?
In addition to answering these questions we also created profiles for all the people and organisation entities we extracted during the enrichment process –about 1 million people and 100k organisations.
I’ll show you some screen shots of the prototype a bit later in my presentation but first I’d like to return to some of the underlying work we had to do in order to be able to answer the questions I just listed.
The very first thing was look at our data. Did we have the information we needed within Inspec to answer these questions and was it held in the right format to make this possible?
The answer in both cases was no.
In addition to Inspec data we partnered with Emerald Group Publishing Limited, RenewableUK (a UK based renewable energy trade association specialising in wind and wave), Ofgem (the UK independent National Regulatory Authority for electricity and gas) and CambridgeIP (a UK based IP consultancy) who all provided data to us free of charge for prototyping purposes.
We also needed to model this data in the way shown on the slide. We did this in conjunction with our development partners, Ontotext AD and 67 Bricks.
We had a bit of a head start because we already own an engineering and technology taxonomy, the Inspec Thesaurus, which contains around 20,000 terms, manually compiled by subject matter experts over the last 40 years. But we also knew that to do what we wanted to do with the prototype, we needed to model a much broader set of concepts and relationships including people, organisations, places, activities, products and publications.
And each of these entities or sets of things would need to be subdivided into subsets and subsets of subsets until we reached a level of granularity where each entity within a set had the same attributes and relationships associated with them.
Just to give you an idea of how powerful this way of arranging your data is, the simple structure on this slide would allow a system to answer quite complex questions like: “Can you give me the names and email addresses of people working for organisations who develop wind turbines?
An example closer to home requires another set of entities, namely publications. This would then allow us to answer questions like: “Who were the top contributors in the field of wind turbine motor research between 2012 and 2013?” or “is wind turbine motor research a growth area?”. If we then add patent data into the mix we can start to answer even more complex questions like “wind turbine research was identified as a research growth area in 2010, how may patents have been filed since then?” In other words, we can use our engineering domain model to assess how quickly academic research leads to product innovation.
So let’s go back to the prototype and see some of the questions and answers in context.
The very first thing was look at our data. Did we have the information we needed within Inspec to answer these questions and was it held in the right format to make this possible?
The answer in both cases was no.
In addition to Inspec data we partnered with Emerald Group Publishing Limited, RenewableUK (a UK based renewable energy trade association specialising in wind and wave), Ofgem (the UK independent National Regulatory Authority for electricity and gas) and CambridgeIP (a UK based IP consultancy) who all provided data to us free of charge for prototyping purposes.
We also needed to model this data in the way shown on the slide. We did this in conjunction with our development partners, Ontotext AD and 67 Bricks.
We had a bit of a head start because we already own an engineering and technology taxonomy, the Inspec Thesaurus, which contains around 20,000 terms, manually compiled by subject matter experts over the last 40 years. But we also knew that to do what we wanted to do with the prototype, we needed to model a much broader set of concepts and relationships including people, organisations, places, activities, products and publications.
And each of these entities or sets of things would need to be subdivided into subsets and subsets of subsets until we reached a level of granularity where each entity within a set had the same attributes and relationships associated with them.
Just to give you an idea of how powerful this way of arranging your data is, the simple structure on this slide would allow a system to answer quite complex questions like: “Can you give me the names and email addresses of people working for organisations who develop wind turbines?
An example closer to home requires another set of entities, namely publications. This would then allow us to answer questions like: “Who were the top contributors in the field of wind turbine motor research between 2012 and 2013?” or “is wind turbine motor research a growth area?”. If we then add patent data into the mix we can start to answer even more complex questions like “wind turbine research was identified as a research growth area in 2010, how may patents have been filed since then?” In other words, we can use our engineering domain model to assess how quickly academic research leads to product innovation.
So let’s go back to the prototype and see some of the questions and answers in context.
This slide shows the answer to the first question in the list: Who is carrying out research into technology x (and related fields)?
Because of the entities we chose to model and the relationships we made between them we were able to rank authors both in terms of number of publications over time as well as by number of citations.
In addition, we used the Geonames API to enable us to plot the location of authors based on the address of the organisation they were most recently affiliated to.
We also pulled in tweets based on the search term as well as creating profiles for the people and organisations we identified.
And here’s an example of a profile:
We have provided this gentleman’s email address so he can be contacted
We have identified those areas he specialises in
And we have provided links to individuals associated with the organisation he is most recently linked to as well as a list of organisations he himself has been associated with over time.
On the second tab we have provided abstracts for all his publications as well as an individual tag cloud.
And on the third tab we have linked to patent abstracts where he is identified as one of the Authors
Because of the way the data is stored, every reference to an author or an organisation is also a link to their profile.
We are still exploring how to extend person profiles to include social media and news but disambiguating people and especially linking people to their twitter feeds is a hard problem and requires manual input and QA.
Profiles of organisations were a little easier to disambiguate and so have more detail in them.
For example, we were able to provide additional contact information, company logos and descriptions as well as links to social media and news information scraped for the companies’ websites.
We also investigated the possibility of listing similar organisations based on groups of tags – results were mixed and this area of functionality still needs further work!
Finding lists of products was also hard but we built a scraping tool to help us and did some manual QA and editing on the results.
Finally, we included some third party data sets by linking to the Open Corporates API for company and financial data and to a number of sources for patent data including CambridgeIP, American open source patent data and the European patent office API.
The main learning points for us were that while it is possible to create profiles based on Inspec data these are very limited and to make them really useful you need to combine this data with many other data sources as well. You also need to manually complete profiles and to regularly review and update them.
Moving on, the second question we wanted to answer was: Is research into technology x (and related fields) increasing or decreasing?
As you can see on the slide we provided this for individual technical concepts and then also for those concepts related to the original by the various relationships modelled in the Inspec Thesaurus.
Once again, we produced a list of publication abstracts with links to the full text and to the profiles of individuals and organisations identified in the abstracts.
Because we had decided to focus on renewable energy our next application is more obviously focussed in this area.
It was built to answer two questions: What companies specialise in technology x near location y? and what companies near location y operate/develop/manufacture or install technology x?
The IET already has a directory product, E&T Marketplace, which is compiled in a more traditional manner. We were keen to explore what could be done by a combination of web scrapping and semantic enrichment techniques to reduce production costs and add additional functionality.
As you might expect, the results weren’t clear cut. We found identifying relevant companies was easy, but extracting product information was hard.
We were able to pull out interesting and relevant filters, identify relevant social media and news feeds and incorporate 3rd party data sets but the information we extracted again needed some manual curation and would obviously require regular updating to remain relevant.
And this is the third application we built to answer the last two questions on my original list: is location y suitable for installing renewable technology x? and can you show me information about operating renewable technology x installations near to location y?
This was the one that we developed in conjunction with RenewableUK, a trade association specialising in supporting wind energy technologies. The app is an interactive version of a pdf brochure they currently publish on their website.
Once again, we incorporated a number of third party data sets including open data published by the UK government on average wind speeds per square kilometre for the entire UK.
The idea here was to allow users to see if there was any point in considering installing a wind turbine at their address. If it was a possibility, then to help them hire a low cost wind gauge to check the reality themselves and then, if the results over three months were promising, to put them in touch with a local installer.
We also used more open data to create a searchable database of all UK wind turbine installations including output statistics and information on the operators and owners. Contact details are provided to enable individuals or organisations considering installing wind turbines to get in touch with existing owners for advice on planning, which can be fraught, tariffs etc.
Once we’d completed the prototype we held workshops with all the organisations that had provided us with content.
The aim of these workshops was to review the prototype and to gauge their interest in creating a production version.
What was really interesting was that while they were interested in doing this they were more interested in the technology they had produced these applications and how it might help them to understand their own content and the markets that they operate in themselves.
This has led us to consider providing a number of services in advance of producing full blown knowledge hubs.
A pre-trained enrichment module based on the engineering domain model we are developing from our class leading Inspec thesaurus
A hosted version of this service so you don’t even need to do the QA
A set of content analytics applied to your data
And a further set of engineering market analytics enabling you to compare your content to your competitors content
In the latter two cases you don’t even have to understand anything about semantic enrichment in order to benefit from it and you don’t need to re-organise the way you store your data. We would take care of that for you.
We believe that these services and the combination of automated enrichment and manual QA and editorial services would allow you as publishers to improve your content acquisition strategies, to develop your peer review networks, to provide real time notifications to authors and even to increase the number of citations of your publications by pro-actively suggesting citations opportunities to researchers in relevant communities.
A few final thoughts then.
Here at the IET we are really excited about what is possible when you mix semantic enrichment software with a domain model for engineering and an existing indexing and QA service.
The work we have done on our prototype suggests that this framework applied to a breadth of content types and sources really can create truly linked data which will make a real difference to end users be they academic researchers, product managers, practising engineers or consultants or indeed, anyone else working in engineering who has a question that needs answering.
My final message to you all, therefore, is this: if you have engineering data and would like to discover more about what it could tell you, either in isolation or in conjunction with the wider Inspec engineering Universe then PLEASE get in touch with either myself, David Smith or Daniel Smith. We would love you to be part of our journey.