This document discusses how natural language processing (NLP) and semantics can help address challenges in four domains: monitoring violations against journalists, disaster relief, scientometrics, and the "B-word." It provides examples of how NLP tools can extract and categorize information, link entities and events, geotag social media posts, and connect different data sources to provide a richer understanding of knowledge production. Semantic technologies like ontologies are presented as a way to coherently connect topics across document types and data sources.
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...Diana Maynard
Talk given at the Data Pioneers 1st meetup in London, 27 July 2017.
Abstract:
The GATE open source NLP toolkit has now been in continuous development for 20 years at the University of Sheffield. Originally funded by a small EPSRC research grant, it now involves a team of 12 researchers working on it, and has been downloaded by hundreds of thousands of users all over the world. Its users range from solitary research students to multinational companies and government institutions. In this talk, I will give an overview of my work with GATE, giving examples of real-life case studies, ranging from analysing polarised opinions in online political debates (Brexit, the UK, French and US elections) through to finding a new cause of cancer by analysing information in the biomedical domain.
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...Diana Maynard
Talk given at the Data Pioneers 1st meetup in London, 27 July 2017.
Abstract:
The GATE open source NLP toolkit has now been in continuous development for 20 years at the University of Sheffield. Originally funded by a small EPSRC research grant, it now involves a team of 12 researchers working on it, and has been downloaded by hundreds of thousands of users all over the world. Its users range from solitary research students to multinational companies and government institutions. In this talk, I will give an overview of my work with GATE, giving examples of real-life case studies, ranging from analysing polarised opinions in online political debates (Brexit, the UK, French and US elections) through to finding a new cause of cancer by analysing information in the biomedical domain.
Social media sentiment and consumer confidencePiet J.H. Daas
Presentation on the association between the sentiment in public Dutch social media messages and Dutch consumer confidence. Given at the ECB conference on Big data and forecasting and statistics in Frankfurt
In democracies, demonstrating is a legitimate way for citizens to let their officials know how they feel about important topics and try to change policies or attitudes. Peaceful demonstrations are powerful to keep the checks and balances in democracies. As we have seen over the ages (going back to Roman times), once demonstrations turn into riots, democracies are shaken to the core. During a riot, the fine line between being an activist and a criminal is often crossed.
For law enforcement, restoring and keeping order is a challenge. It involves identifying the agitators, those actors who believe that violent means justify the cause, and those who join demonstrations (often in other cities) to create trouble. Law enforcement needs to have the tools to identify and separate the bad apples from the rest to protect the fundamental democratic right to demonstrate.
Creating a Data-Driven Government: Big Data With PurposeTyrone Grandison
The U.S. Department of Commerce collects, processes and disseminates data on a range of issues that impact our nation. Whether it's data on the economy, the environment, or technology, data is critical in fulfilling the Department's mission of creating the conditions for economic growth and opportunity. It is this data that provides insight, drives innovation, and transforms our lives. The U.S. Department of Commerce has become known as "America's Data Agency" due to the tens of thousands of datasets including satellite imagery, material standards and demographic surveys.
But having a host of data and ensuring that this data is open and accessible to all are two separate issues. The latter, expanding open data access, is now a key pillar of the Commerce Department's mission. It was this focus on enhancing open data that led to the creation of the Commerce Data Service (CDS).
The mission at the Commerce Data Service is to enable more people to use big data from across the department in innovative ways and across multiple fields. In this talk, I will explore how we are using big data to create a data-driven government.
This talk is a keynote given at the Texas tech University's Big Data Symposium.
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
Online text data for machine learning, data science, and research - Who can p...Fredrik Olsson
This slide deck concerns online text data for machine learning, artificial intelligence, data science, and scientific research. After this talk, you’ll know who can provide online text data, what types of data are hard to get, and principal data hygiene factors.
Updated in August 2019.
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
A presentation prepared for KSFR, a public radio station in Santa Fe, New Mexico, USA. The main point is that the station should develop a "digital first" approach to all aspects pertaining to its Audience(s), Content and Technologies.
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Lewis Shepherd
Discussion of current Microsoft Research projects and prospects which help drive open innovation and agile experimentation via cloud-based services; and projects which aim at advancing the state-of-the-art in knowledge representation and reasoning under uncertainty at web scale. I also begin by discussing potential malign implications of mass automated implementations of linked-data systems, as functions of what governments (and users of public data) can/should/shouldn’t do in promoting mass activity.
In this lecture we explore how big datasets can be used with the Weka workbench and what other issues are currently under discussion in the real world, for ex: big data applications, predictive linguistic analysis, new platforms and new programming languages.
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
The Coming Explosion of Records at FamilySearch Syllabusbakers84
Syllabus for the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
Presented at Esri Health GIS ConferenceS
cottsdale, AZ USA
|28 August 2012
Presentation slides at w w w . s l i d e s h a r e . N e t / j t j o h n s o n
Data Makes the Maps; Maps Make the Data by J. T Johnson is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
Big data and Digital Transformations in the HumanitiesMartin Wynne
Big Data and Digital Transformations in the Humanities – are we there yet? Presentation given at the workshop 'extual Digital Humanities and Social Sciences: Data > Interpretation > Understanding' at the University of Aberdeen, 21-22 September 2015
Social media sentiment and consumer confidencePiet J.H. Daas
Presentation on the association between the sentiment in public Dutch social media messages and Dutch consumer confidence. Given at the ECB conference on Big data and forecasting and statistics in Frankfurt
In democracies, demonstrating is a legitimate way for citizens to let their officials know how they feel about important topics and try to change policies or attitudes. Peaceful demonstrations are powerful to keep the checks and balances in democracies. As we have seen over the ages (going back to Roman times), once demonstrations turn into riots, democracies are shaken to the core. During a riot, the fine line between being an activist and a criminal is often crossed.
For law enforcement, restoring and keeping order is a challenge. It involves identifying the agitators, those actors who believe that violent means justify the cause, and those who join demonstrations (often in other cities) to create trouble. Law enforcement needs to have the tools to identify and separate the bad apples from the rest to protect the fundamental democratic right to demonstrate.
Creating a Data-Driven Government: Big Data With PurposeTyrone Grandison
The U.S. Department of Commerce collects, processes and disseminates data on a range of issues that impact our nation. Whether it's data on the economy, the environment, or technology, data is critical in fulfilling the Department's mission of creating the conditions for economic growth and opportunity. It is this data that provides insight, drives innovation, and transforms our lives. The U.S. Department of Commerce has become known as "America's Data Agency" due to the tens of thousands of datasets including satellite imagery, material standards and demographic surveys.
But having a host of data and ensuring that this data is open and accessible to all are two separate issues. The latter, expanding open data access, is now a key pillar of the Commerce Department's mission. It was this focus on enhancing open data that led to the creation of the Commerce Data Service (CDS).
The mission at the Commerce Data Service is to enable more people to use big data from across the department in innovative ways and across multiple fields. In this talk, I will explore how we are using big data to create a data-driven government.
This talk is a keynote given at the Texas tech University's Big Data Symposium.
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
Online text data for machine learning, data science, and research - Who can p...Fredrik Olsson
This slide deck concerns online text data for machine learning, artificial intelligence, data science, and scientific research. After this talk, you’ll know who can provide online text data, what types of data are hard to get, and principal data hygiene factors.
Updated in August 2019.
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
A presentation prepared for KSFR, a public radio station in Santa Fe, New Mexico, USA. The main point is that the station should develop a "digital first" approach to all aspects pertaining to its Audience(s), Content and Technologies.
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Lewis Shepherd
Discussion of current Microsoft Research projects and prospects which help drive open innovation and agile experimentation via cloud-based services; and projects which aim at advancing the state-of-the-art in knowledge representation and reasoning under uncertainty at web scale. I also begin by discussing potential malign implications of mass automated implementations of linked-data systems, as functions of what governments (and users of public data) can/should/shouldn’t do in promoting mass activity.
In this lecture we explore how big datasets can be used with the Weka workbench and what other issues are currently under discussion in the real world, for ex: big data applications, predictive linguistic analysis, new platforms and new programming languages.
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
The Coming Explosion of Records at FamilySearch Syllabusbakers84
Syllabus for the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
Presented at Esri Health GIS ConferenceS
cottsdale, AZ USA
|28 August 2012
Presentation slides at w w w . s l i d e s h a r e . N e t / j t j o h n s o n
Data Makes the Maps; Maps Make the Data by J. T Johnson is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
Big data and Digital Transformations in the HumanitiesMartin Wynne
Big Data and Digital Transformations in the Humanities – are we there yet? Presentation given at the workshop 'extual Digital Humanities and Social Sciences: Data > Interpretation > Understanding' at the University of Aberdeen, 21-22 September 2015
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
In his abstract, Scriffignano summarizes as follows:
l explore some of the ways in which the massive availability of data is changing and the types of questions we must ask in the context of making business decisions. Truth be told, nearly all organizations struggle to make sense out of the mounting data already within the enterprise. At the same time, businesses, individuals, and governments continue to try to outpace one another, often in ways that are informed by newly-available data and technology, but just as often using that data and technology in alarmingly inappropriate or incomplete ways. Multiple “solutions” exist to take data that is poorly understood, promising to derive meaning that is often transient at best. A tremendous amount of “dark” innovation continues in the space of fraud and other bad behavior (e.g. cyber crime, cyber terrorism), highlighting that there are very real risks to taking a fast-follower strategy in making sense out of the ever-increasing amount of data available. Tools and technologies can be very helpful or, as Scriffignano puts it, “they can accelerate the speed with which we hit the wall.” Drawing on unstructured, highly dynamic sources of data, fascinating inference can be derived if we ask the right questions (and maybe use a bit of different math!). This session will cover three main themes: The new normal (how the data around us continues to change), how are we reacting (bringing data science into the room), and the path ahead (creating a mindset in the organization that evolves). Ultimately, what we learn is governed as much by the data available as by the questions we ask. This talk, both relevant and occasionally irreverent, will explore some of the new ways data is being used to expose risk and opportunity and the skills we need to take advantage of a world awash in data.
Introduction to Computational Social Science - Lecture 1Lauri Eloranta
First lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015. (http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
GATE: a text analysis tool for social mediaDiana Maynard
Short tutorial about how and why to use GATE for text analysis of social media, given at the Big Social Data workshop at Reading University in April 2015.
Multimodal opinion mining from social mediaDiana Maynard
Presentation at the BCS SGAI 2013 conference in Cambridge, December 2013, describing the combination of opinion mining from text and multimedia from social media.
What do you really mean when you tweet? Challenges for opinion mining on soci...Diana Maynard
This talk, given at BRACIS 2013, introduces the topics of opinion mining and social media analytics, in particular looking at the challenges they impose for an NLP system. It investigates the impact of non-standard text in social media, use of sarcasm, swear words, non-words, short sentences, multiple languages and so on, which impede the success of current NLP tools to perform good analysis, and examines tools being developed in some current cutting-edge research projects, including not only text-based research but also multimedia analysis.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Adding value to NLP: a little semantics goes a long way
1. Adding value to NLP:
a little semantics goes a long way
Dr. Diana Maynard
University of Sheffield, UK
2. 15 years ago in NLP, we were…
• Using ontologies for adding reasoning/additional information
to text
• Developing linguistic tools for conceptual modelling
• Semantic enrichment for ontology mapping
• Developing ways to get from text to ontologies and back
• Trying to persuade the Semantic Web community that NLP was
important
3. • Paul Buitelaar, Daniel Olejnik, Michael Sintek: A Protégé Plug-In for
Ontology Extraction from Text Based on Linguistic Analysis
• Fabio Ciravegna, Sam Chapman, Alexiei Dingli, Yorick Wilks:
Learning to Harvest Information for the Semantic Web
• Sanghee Kim, Paul H. Lewis, Kirk Martinez, Simon Goodall:
Question Answering Towards Automatic Augmentations of
Ontology Instances
• Naoki Sugiura, Yoshihiro Shigeta, Naoki Fukuta, Noriaki
Izumi, Takahira Yamaguchi: Towards On-the-Fly Ontology
Construction - Focusing on Ontology Quality Improvement
• John Domingue, Martin Dzbor, Enrico Motta:
Collaborative Semantic Web Browsing with Magpie
NLP at ESWS 2004
4. • Finding “hidden” semantic links between concepts using
gloss information from dictionaries
• (We want to know that Henry Ford makes cars)
• These days we could just check Wikipedia J
Mapping semantic relations
8. • The techniques may have changed a bit: we now have
• lots of lovely big data to play with
• open source knowledge bases
• powerful GPUs (crack cocaine for NLP students)
• deep learning (all the cool kidz J )
• But there are a number of fields where people still haven’t even
heard of the word “ontology” (or at least, they don’t know what it
means)
Many of the underlying technologies are still relevant
9. How to impress a “freedom of the media” researcher
“What is this
wizardry?”
10. The problem: too much unstructured information
Feral databases
On 26 June 1996, while driving her red Opel Calibra, Guerin stopped at a red
traffic light on the Naas Dual Carriageway near Newlands Cross, on the outskirts
of Dublin, unaware she was being followed. She was shot six times, by one of two
men sitting on a motorcycle.
The solution: categorise it and stick it in a spreadsheet
“Everybody loves spreadsheets!”
How many journalists died in 1996?
11. That’s great, we have lots of NLP tools to do that!
Name Date Location
City
Location
Country
Event type Type of
death
Guerin 26061996 Dublin Ireland Killing Shooting
12. Woe betide anyone who opens the sacred spreadsheet!
Mass panic
Let’s start a new spreadsheet!
Aaaaggghhhh!!!
15. • Violations against journalists
• Disaster relief
• Scientometrics
• The B-word
Case studies: how can NLP and semantics help?
16. Global monitoring of violations against journalists
• The situation: cases of killing, kidnapping, enforced disappearance,
arbitrary detention and torture of journalists, associated media
personnel, often with impunity
• The task: UN SDG agenda (indicator 16.10.1): we need a
comprehensive monitoring system capturing the scope & nature of
violations (lethal & non-lethal)
• The problem: the data is a horrible mess!
• accessibility of reliable information on violations
• gaps in the data beyond killings; cannot compile data on a
wide range of violations; practical challenges of collection
• systematisation and compiling of information
• conceptual inconsistency; much information is
unstructured
17. Events are complex
The family of murdered Maltese anti-corruption journalist
Daphne Caruana Galizia is demanding an independent
public inquiry because she had suffered years of
intimidation.
She was killed by a car bomb near her home in October.
Her widely-read blog accused top politicians of corruption.
One of her sons, Paul, said three pet dogs were killed and
attempts were made to burn down the journalist's home.
• Several separate events, but
• are they related?
• was it the same perpetrator(s)?
• Traditional methods use a person-centric approach with no
relation between events
18. But we know how to understand complex events
SemEval 2018 Task on Counting Events and Participants in the Long Tail
23. Reconciling information from different sources
Relation Example Issue Solution Same
Event?
Exact match Location: London
Location: London
Identical facts Two events can
be merged
Very likely
Equivalent
match
Type: murder
Type: assassination
Semantically
equivalent facts
Two events can
be merged
Very likely
No-conflict Location: London
Date: 2019
Different but
semantically
compatible facts
Facts from two
events can be
merged
Probably
Specificity
Conflict
Location: London
Location: UK
One fact is more
specific than the
other, but both are
semantically
compatible
Facts can be
merged at either
specific or
generic level
Probably
Direct
conflict
Location: London
Location: Paris
Facts conflict
semantically: both
cannot be correct
More info
needed to
resolve conflict
Unlikely
24. Reconciling our information
Name Date City Country Event type Type of
death
Guerin 26061996 Dublin Ireland Killing Shooting
Name Date Organisation Location Event type
Veronica Guerin 26 June
1996
Sunday Independent Ireland Murder
Name Date Organisation Location Event type Sources
Veronica
Guerin
26061996 Sunday
Independent
Dublin,
Ireland
Shooting:
deliberate, fatal
2
25. Automatic categorization of free text
Entities and events are extracted,
along with features, and linked to
other knowledge sources, e.g.
Wikipedia, other categorization
schemes
26. Social Media and Disaster Relief
• Hurricane Sandy in the US:
• 1.1 million tweets in the first day; over 20 million in total
• > 800K photos with #Sandy hashtag on Instagram
• Haze in Singapore: > 23 million
• Nepal earthquake in 2015: more than half a million posts
28. Tools to help disaster
victims get aid quickly:
pinpointing geographic
locations
28
• Many NGOs are not local to the
disaster area and may not have
a good grasp of the geography
• Place names are ambiguous
• Find mentions of locations in
the text, match them to a
knowledge base, and plot them
on a map
29. Adding semantic information (from BabelNet) improves cross-lingual
crisis event classification
CREES Google Sheets Add-on
30. Semantic Technologies in Scientometrics
Opportunities:
• Ability to link different kinds of data sources to provide a
richer view of knowledge production in Europe
Challenges
• Need for a robust approach to identify and model
relevant topics
• Language (connect different kinds of data due to
terminology differences)
• Commensurability (cannot connect different kinds of
classifications)
• Flexibility (model changes over time and space)
31. What kind of questions do we want to answer?
• Which country published most
about waste management and
recycling in 2014?
• What happens when you look only
at the top 10% most cited?
• What kind of international
collaborations do we see?
• What about patents?
The problem:
• topics for different document types don’t match –
different classification systems
• and they don’t correlate with EU policies
35. • A composite indicator combining publications, patents and
projects shows that:
• the volume of knowledge production is highly concentrated
in large metropolitan regions, e.g. Paris, London, Munich
• some medium-sized regions are highly productive in terms of
intensity (normalised by population), e.g. Eindhoven and
Heidelberg
• some smaller areas have high volume and intensity, e.g.
Oxfordshire
• Eastern Europe shows low volume and intensity, except major
cities, but all have low intensity (except Ljubljana)
How is European knowledge distributed across regions?
36. • Technological production is measured by patents
• Scientific production is measured by publications
• These 2 types show different geographical distributions:
technological are more concentrated in space
• In terms of volume, Paris is the biggest cluster for both types
• Within regions, production varies a lot: London is the biggest
producer of both types, while Eindhoven is key in terms of
technological knowledge (both for volume and intensity)
• These findings reflect the different structure of public and private
knowledge
Technological vs scientific knowledge production in genomics
38. The Semantic Approach
Perspectives on
CO2 capture and
storage
Filipp Johnsson
Published 14-04-11
SC5-20-2014
H2020
Zero Emission
Robot-Boat for
Coastal and Inland
Water Monitoring
What is the innovation
performance of France
on climate change
compared with
Germany?
Policy Ontology Data
6687 2007 0
LED module with
gold bonding.
Processes or
apparatus
specially adapted
for the
manufacture or
treatment of
semiconductor
In a nutshell:
• We need to know which topics each document is talking about
(multi-class classification)
• But we have to connect these topics together coherently
39. Ontologies connect information
Find more information
about the topicLink related topics
Link with other sources
(Nature.com, skos,
DBpedia…)
40. From ontology to data
1. Create ontology of topics representing KET and SGC
• From existing classifications, policy documents, expert users,
and data
2. Automatically generate collections of keywords
• NLP techniques (term extraction, word embeddings) from large
training dataset
• Ranking and scoring algorithms to decide:
• Which topic(s) to match the keywords to?
• Which are the best keywords?
• Which are the best keyword combinations?
3. For each document, decide which topics best fit it
• based on keywords and scoring algorithms
energy storage
storage of energy
accumulator
hydraulic accumulator
capacitor
41. Creating and
populating the
ontology
1. Create ontology
structure (classes &
subclasses)
2. Add extra information
(descriptions, links,
alternate class names)
3. Ontology population:
generate lists of terms
associated with each
class
45. Ontology population
Sustainable development of urban areas is a challenge of key
importance. It requires new, efficient, and user-friendly technologies and
services, in particular in the areas of energy, transport and ICT. However,
these solutions need integrated approaches, both in terms of research and
development of advanced technological solutions, as well as deployment.
The focus on smart cities technologies will result in commercial-scale
solutions with a high market potential.
1. Automatically generate keywords from class names,
descriptions, and related information (e.g. DBpedia, skos,
etc.) using term recognition tools
2. Enrich using word embeddings
3. Score the keywords according to how representative they
are of that class
4. Generate prior probabilities using PMI for term
combinations, based on frequency of co-occurrence
46. • Data sources are annotated against the ontologies
• each document is associated with one or more topics
• Sophisticated NLP matching and scoring of terms in the
documents with ontology
• A REST service accepts documents, scores and classifies them
according to the ontology, and returns classification and keyword
information
• Several million documents can be processed in about a week
(using around 12 threads)
• Annotated data sources are then used to build indicators
• e.g. for each topic, how many publications and in which
region?
Annotating Data with Ontologies
48. Example of patent annotation
Protein stabilized pharmacologically active agents, methods for the
preparation thereof and methods for the use thereof
In accordance with the present invention, there are provided compositions and
methods useful for the in vivo delivery of substantially water-insoluble
pharmacologically active agents (such as the anti-cancer drug paclitaxel) in which
the pharmacologically active agent is delivered in the form of suspended particles
coated with protein (which acts as a stabilizing agent)…..
• RNA vaccines: (agent, protein, vaccine)
• anti-viral agents: (protein, anti-cancer, drug)
• protein vaccines: (protein, vaccine, antimicrobial)
KET: Industrial biotechnology
SGC: Health
49. Ongoing Challenges
Inconsistencies
• ontology design has to be tailored to user needs, but these are
not uniform
Automation
• keyword-based approach still requires some manual
intervention for best results
Accuracy
• language processing is never 100% accurate
Evaluation
• how do we know if/when it’s good enough?
• Determine weighting mechanisms; cut-off thresholds…
The future?
• integration of existing classification and modelling approaches
with our semantics
50. • Built on top of annotated text in GATE
• can be used to index and search over text, annotations,
semantic metadata
• allow queries that arbitrarily mix full-text, structural,
linguistic and semantic annotations
• open source
• you can issue ridiculously complex queries
GATE MIMIR and Prospector
51. {
{Person gender=male sparql = "SELECT ?inst WHERE {
?inst :party
<http://dbpedia.org/resource/Labour_Party_%28UK%29> .
?inst :almaMater
<http://dbpedia.org/resource/University_of_Edinburgh> }"
}
[0..3] root:say
[0..5] {Money currency=“£” value > 2000000}
}
IN
{Document date > 20050000 topic=uk_health}
58. • How do people in different
parts of the country talk
about elections and
political events?
• How do the MPs talk about
different topics?
• How does the public
respond to them?
• What kind of people send
hate speech, who to, and
what about?
Social media and politics
59. Hate speech against politicians Jan-Feb 2019
Annotation of tweets with simple DBpedia and NUTS linking
lets us index and query our data in interesting ways
60. Who gets the abuse?
• Click to edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level
20172015
64. Local and national news coverage of Brexit
• Green/blue indicates areas where local coverage outweighed
national coverage
• Across the regions, Twitter remainers showed closer congruence
with local press than Twitter leavers
65. • We have lots of cool technology and data
• It doesn’t always need to be complex, it just needs to be
applied
• There are a number of areas where semantic
technologies are still really unknown
• New emerging domains of interest: journalism, politics,
law, humanities (literature, geography, history…)
• Don’t neglect interdisciplinary “borrowing”
• The usual disclaimers about accuracy – real-world
evaluation
So where are we at?
66. Acknowledgements
This work supported by the European Union/EU under the Information and
Communication Technologies (ICT) theme of the 7th Framework and H2020
Programmes for R&D:
• KNOWMAK (726992) http://knowmak.eu
• RISIS2 (82409) http://risis2.eu
• SoBigData (654024) http://www.sobigdata.eu
• WeVerify (825297) https://weverify.eu/
• COMRADES (687847) http://www.comrades-project.eu
• EPSRC grant EP/I004327/1 and British Academy under call “The Humanities
and Social Sciences: Tackling the UK’s International Challenges”.
• Nesta http://nesta.org.uk
• Free Press Unlimited pilot project: “Developing a database for the improved
collection and systematisation of information on incidents of violations
against journalists” https://www.freepressunlimited.org
And all the GATE team in Sheffield!