SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
NLP/NIF
                 Knowledge and Media 2012-2013
                 Lecture 11
Monday, December 3, 12
Monday, December 3, 12
Overview

                  Natural Language Processing 101

                  The NLP pipeline

                  NLP tasks

                  NLP Challenges

                  NIF (NLP Interchange Format)



Monday, December 3, 12
NLP: What is it?
                  NLP or text analytics adds semantic understanding of:

                         named entities: people, companies, locations, etc.

                         pattern-based entities: email-addresses, phone numbers

                         concepts: abstractions of entities

                         facts and relationships

                         concrete and abstract attributes (e.g., 5 years, expensive)

                         subjectivity in the form of opinions, sentiments and
                         emotions

                                    SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12
80% of relevant information to businesses is in
                  ‘unstructured’ textual form:

                         web pages, news and blog articles, forum postings,
                         other social media

                         email and messages

                         surveys, feedback forms, warranty claims

                         scientific literature, books, legal documents, patents

                         ...


                                SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12
NLP: What is it for?
                  NLP transforms unstructured text into structured
                  information which may be:

                         categorised

                         queried

                         mined for patterns, topics or themes

                         presented intelligently

                         visualised and explored

                                SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12
NLP: Some history

                  1950 - 1980: Handwritten rules

                         Russian - English translation system

                         ELIZA

                  Since 1980: Machine learning

                         IBM’s Watson



Monday, December 3, 12
NLP: Tasks




                              IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG

Monday, December 3, 12
Morphological/Lexical
                 Analysis

                  Language identification

                  Tokenisation

                  Stemming/Lemmatisation




Monday, December 3, 12
Syntactic Analysis

                  Text segmentation

                  Part of Speech (POS) tagging

                  Chunking

                  Shallow Parsing




Monday, December 3, 12
Semantic Analysis

                  Named entity recognition (NER)

                  Relation finding

                  Semantic role labelling (SRL)

                  Word-sense disambiguation (WSD)

                  Co-reference/anaphora resolution



Monday, December 3, 12
Semantic Analysis (ctd)

                  Topic detection/segmentation

                  Machine Translation (MT)

                  Sentiment analysis/opinion mining

                  Automatic summarisation




Monday, December 3, 12
NLP: Approaches


                  Rule-based

                  Statistical

                  Hybrid methods




Monday, December 3, 12
Named Entity
                 Recognition Explained




Monday, December 3, 12
NER: State-of-the-Art

                  Statistical methods: Conditional Random Fields
                  (CRF)

                  Precision: 92.15%

                  Recall: 92.39%

                  F-Measure: 92.27%




Monday, December 3, 12
Precision
                     How many predictions were correct?

                     P=TP/(TP+FP)
                                                      ACTUAL


                                               Spam         Not Spam


                                            True Positive False Positive
                                  Spam
                     PREDICTED




                                                (TP)           (FP)


                                               False      True Negative
                                 Not Spam
                                            Negative (FN)      (TN)
Monday, December 3, 12
Recall
                     Of the total number of instances in a class, how many
                     were found?

                     R=TP/(TP+FN)
                                                      ACTUAL


                                               Spam         Not Spam


                                            True Positive False Positive
                                  Spam
                     PREDICTED




                                                (TP)           (FP)


                                               False      True Negative
                                 Not Spam
                                            Negative (FN)      (TN)
Monday, December 3, 12
F-Score
                     Harmonic mean of Precision and Recall

                     F=2 • P • R/(P+R)

                     [Acc=(TP+TN)/(TP+FP+FN+TN)]
                                                             ACTUAL


                                                 Spam             Not Spam


                                             True Positive False Positive
                                  Spam
                     PREDICTED




                                                 (TP)           (FP)


                                               False      True Negative
                                 Not Spam
                                            Negative (FN)      (TN)
Monday, December 3, 12
Machine Learning 101
            Training

                  1. Collect a set of representative training documents

                  2. Label each token for its entity class or other (O)

                  3. Design feature extractors appropriate to the text and classes

                  4. Train a sequence classifier to predict the labels from the data

            Testing

                  1. Receive a set of testing documents

                  2. Run sequence model inference to label each token

                  3. Appropriately output the recognised entities


                         SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF

Monday, December 3, 12
k-NN




                         HTTP://WWW.YOUTUBE.COM/USER/ANTALVANDENBOSCH#P/
                                          U/2/PB4QATZITLQ
Monday, December 3, 12
NER Training Data
                  IOB Scheme

                  Inside, Outside, Begin

                  For each type of entity there is an I-XXX and a B-XXX tag

                  Non-entities are tagged O

                  B-XXX only used if two entities of same type next to each other

                  Assumes that named entities are non-recursive and don’t overlap

                  Example:

                         Meg Whitman   CEO       of eBay
                         I-PER I-PER    O        O I-ORG


                                           SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF

Monday, December 3, 12
Features for text
                 learning task
                  Is the word capitalised?

                  Is the word at the start of a sentence?

                  What is the Part of speech tag?

                  Previous and following words

                  Info from gazetteers



                  Useful features help your learner, badly chosen features may
                  harm it
                                   SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF

Monday, December 3, 12
Relation Finding
                 Explained
                Amphibia   Anura




Monday, December 3, 12
Relation Finding:
                 State-of-the-Art

                  Induce relation-dictionaries using slot filling
                  (AutoSlog)

                  Example-based learning (Snowball)

                  Pattern-recognition over shallow parses (LEILA)




Monday, December 3, 12
Relation Finding: pattern
                 finding over shallow parses
                                     relation
                     direction                       frequency   rating
                                    candidate
                                 is a municipality
                                                        45         +
                                  and a town in
                                 is a municipality
                                                        19         +
                                    and a city in
                                 is a municipality
                                                        10         +
                                        in
                                 is one of the five
                                                        5          -
                                     districts of
                                  is the name of
                                                        5          -
                                 two provinces in

Monday, December 3, 12
RL for domain modelling

                                                                           Species
                                    Order                                                                 Town                                    Type


                                                          is a
                                                        (1.000)                                                                is a town in       on the island of
                            is a                                                                                                  (0.794)             (0.500)
                          (0.854)
                                                    is a                is a
                                                                                 is found in
                                                  (1.000)             (0.833)
                                                                                   (0.566)                                                                Location
                                                                                                                           is a municipality in
                 Family                                                                                                          (0.891)

                                          is a
                                        (1.000)                                                     is a town in                                            is in
                is a                                                                                   (0.759)                                            (0.500)
                                                                  Genus
              (0.750)                                                                                                              may refer to
                                                                                                             is found in            (0.560)
                                Type Name                                                      occur in        (0.635)
            Class
                                                                                               (0.750)                                                   Country
                                                                                is found in
                                                                                  (0.573)
                                                                     occur in
                                                                                                                             may refer to
                                                                     (0.333)
                                                                                                                              (0.482)


                                                                                                    Province



Monday, December 3, 12
RL for template filling
                  Date    Ship       Type      Crew   Ransom

          2005/04/10 Feisty Gas LNG carrier     12    $315,000

          2005/06/27     Semlow    Freighter    10    $50,000

                                     Bulk
          2005/10/28     Panagia                22    $700,000
                                    Carrier
                     Seabourn
          2005/11/05          Cruise ship      210     none
                       Spirit


Monday, December 3, 12
Opinion Mining
                 Explained




Monday, December 3, 12
Opinion Mining:
                 State-of-the-Art
                  Supervised learning using features such as:

                         opinion words and phrases

                         negation

                         part-of-speech-tags

                         dependency parsing



Monday, December 3, 12
Positive or negative?

                  “I bought an iPhone a few days ago. It was such a
                  nice phone. The touch screen was really cool. The
                  voice quality was clear too. Although the battery
                  life was not long, that is ok for me. However, my
                  mother was mad with me as I did not tell her
                  before I bought it. She also thought the phone was
                  too expensive, and wanted me to return it to the
                  shop. … ”

                                               EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY,
                                               IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS),
                                               2010.


Monday, December 3, 12
IBM’s Watson




                         HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXW
Monday, December 3, 12
NLP: Challenges

                  Negation

                  Messy text (twitter and SMS language)

                  Domain adaptation

                  Cross- and multi-document text analysis

                  Resource-scarce languages



Monday, December 3, 12
NIF: Natural Language
                 Processing Interchange Format




Monday, December 3, 12
Monday, December 3, 12
Look familiar?




Monday, December 3, 12
NIF: Why do we need it?
                  Integration of NLP tools

                  Bridge between LOD and NLP communities




Monday, December 3, 12
NIF Claims
            1.    NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is
                  compatible with all other tools, which implement NIF.

            2.    NIF achieves this interoperability by using and defining a most common denominator for annotations.
                  This means that some standard annotations are required to be used. On the other hand NIF is flexible and
                  allows the NLP tools to add any extra annotations at will.

            3.    NIF allows to create tool chains without a large amount of up-front development work. As the output of each
                  tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you
                  need to solve a certain task.

            4.    As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:

                         RDF makes data integration easy: URIs, LinkedData

                         OWL is based on Description Logics (Types, Type inheritance)

                         Availability of open data sets (access and licence)

                         Reusability of Vocabularies and Ontologies

                         Diverse serializations for annotations: XML, Turtle,
                         RDFa+XHTML

                         Scalable tool support (Databases, Reasoning)

                         Data is flexible and can be queried / transformed in many ways


Monday, December 3, 12
Structural
                 interoperability
                  NIF specifies how to create an identifier for
                  uniquely locating arbitrary substrings in a
                  document

                         either using offset- or context-hash-based
                         URIs

                  String ontology to describe Strings

                  Structured Sentence Ontology



Monday, December 3, 12
Conceptual
                 Interoperability
                  Lemma and stem annotations are data type
                  properties in the Structured Sentence Ontology

                  POS tags use OLiA (Ontologies or Linguistic
                  Annotations)

                  NER tags use Semantic Content Management
                  System (SCMS) EU Project




Monday, December 3, 12
Access Interoperability
                  Main interface: wrapper to NIF Web service




                                     IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG

Monday, December 3, 12
NLP/NIF: Wrap up

                  NLP History and tasks

                  Machine learning 101

                  Use-cases NER, relation finding and opinion
                  mining

                  Interoperability NLP results with NIF




Monday, December 3, 12
Further reading/Tools
                  Peter Jackson and Isabelle Moulinier (2007)Natural
                  Language Processing for Online Applications: Text
                  Retrieval, Extraction and Categorization. John
                  Benjamins. ISBN: 9027249938

                  ACL Anthology: A Digital Archive of Research
                  Papers in Computational Linguistics

                  Machine learning: WEKA

                  Natural language processing: GATE


Monday, December 3, 12

Weitere ähnliche Inhalte

Mehr von Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebMarieke van Erp
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit Marieke van Erp
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceMarieke van Erp
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Marieke van Erp
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research Marieke van Erp
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...Marieke van Erp
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...Marieke van Erp
 

Mehr von Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

KM Lecture11 nlp/nif

  • 1. NLP/NIF Knowledge and Media 2012-2013 Lecture 11 Monday, December 3, 12
  • 3. Overview Natural Language Processing 101 The NLP pipeline NLP tasks NLP Challenges NIF (NLP Interchange Format) Monday, December 3, 12
  • 4. NLP: What is it? NLP or text analytics adds semantic understanding of: named entities: people, companies, locations, etc. pattern-based entities: email-addresses, phone numbers concepts: abstractions of entities facts and relationships concrete and abstract attributes (e.g., 5 years, expensive) subjectivity in the form of opinions, sentiments and emotions SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011 Monday, December 3, 12
  • 5. 80% of relevant information to businesses is in ‘unstructured’ textual form: web pages, news and blog articles, forum postings, other social media email and messages surveys, feedback forms, warranty claims scientific literature, books, legal documents, patents ... SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011 Monday, December 3, 12
  • 6. NLP: What is it for? NLP transforms unstructured text into structured information which may be: categorised queried mined for patterns, topics or themes presented intelligently visualised and explored SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011 Monday, December 3, 12
  • 7. NLP: Some history 1950 - 1980: Handwritten rules Russian - English translation system ELIZA Since 1980: Machine learning IBM’s Watson Monday, December 3, 12
  • 8. NLP: Tasks IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG Monday, December 3, 12
  • 9. Morphological/Lexical Analysis Language identification Tokenisation Stemming/Lemmatisation Monday, December 3, 12
  • 10. Syntactic Analysis Text segmentation Part of Speech (POS) tagging Chunking Shallow Parsing Monday, December 3, 12
  • 11. Semantic Analysis Named entity recognition (NER) Relation finding Semantic role labelling (SRL) Word-sense disambiguation (WSD) Co-reference/anaphora resolution Monday, December 3, 12
  • 12. Semantic Analysis (ctd) Topic detection/segmentation Machine Translation (MT) Sentiment analysis/opinion mining Automatic summarisation Monday, December 3, 12
  • 13. NLP: Approaches Rule-based Statistical Hybrid methods Monday, December 3, 12
  • 14. Named Entity Recognition Explained Monday, December 3, 12
  • 15. NER: State-of-the-Art Statistical methods: Conditional Random Fields (CRF) Precision: 92.15% Recall: 92.39% F-Measure: 92.27% Monday, December 3, 12
  • 16. Precision How many predictions were correct? P=TP/(TP+FP) ACTUAL Spam Not Spam True Positive False Positive Spam PREDICTED (TP) (FP) False True Negative Not Spam Negative (FN) (TN) Monday, December 3, 12
  • 17. Recall Of the total number of instances in a class, how many were found? R=TP/(TP+FN) ACTUAL Spam Not Spam True Positive False Positive Spam PREDICTED (TP) (FP) False True Negative Not Spam Negative (FN) (TN) Monday, December 3, 12
  • 18. F-Score Harmonic mean of Precision and Recall F=2 • P • R/(P+R) [Acc=(TP+TN)/(TP+FP+FN+TN)] ACTUAL Spam Not Spam True Positive False Positive Spam PREDICTED (TP) (FP) False True Negative Not Spam Negative (FN) (TN) Monday, December 3, 12
  • 19. Machine Learning 101 Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognised entities SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF Monday, December 3, 12
  • 20. k-NN HTTP://WWW.YOUTUBE.COM/USER/ANTALVANDENBOSCH#P/ U/2/PB4QATZITLQ Monday, December 3, 12
  • 21. NER Training Data IOB Scheme Inside, Outside, Begin For each type of entity there is an I-XXX and a B-XXX tag Non-entities are tagged O B-XXX only used if two entities of same type next to each other Assumes that named entities are non-recursive and don’t overlap Example: Meg Whitman CEO of eBay I-PER I-PER O O I-ORG SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF Monday, December 3, 12
  • 22. Features for text learning task Is the word capitalised? Is the word at the start of a sentence? What is the Part of speech tag? Previous and following words Info from gazetteers Useful features help your learner, badly chosen features may harm it SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF Monday, December 3, 12
  • 23. Relation Finding Explained Amphibia Anura Monday, December 3, 12
  • 24. Relation Finding: State-of-the-Art Induce relation-dictionaries using slot filling (AutoSlog) Example-based learning (Snowball) Pattern-recognition over shallow parses (LEILA) Monday, December 3, 12
  • 25. Relation Finding: pattern finding over shallow parses relation direction frequency rating candidate is a municipality 45 + and a town in is a municipality 19 + and a city in is a municipality 10 + in is one of the five 5 - districts of is the name of 5 - two provinces in Monday, December 3, 12
  • 26. RL for domain modelling Species Order Town Type is a (1.000) is a town in on the island of is a (0.794) (0.500) (0.854) is a is a is found in (1.000) (0.833) (0.566) Location is a municipality in Family (0.891) is a (1.000) is a town in is in is a (0.759) (0.500) Genus (0.750) may refer to is found in (0.560) Type Name occur in (0.635) Class (0.750) Country is found in (0.573) occur in may refer to (0.333) (0.482) Province Monday, December 3, 12
  • 27. RL for template filling Date Ship Type Crew Ransom 2005/04/10 Feisty Gas LNG carrier 12 $315,000 2005/06/27 Semlow Freighter 10 $50,000 Bulk 2005/10/28 Panagia 22 $700,000 Carrier Seabourn 2005/11/05 Cruise ship 210 none Spirit Monday, December 3, 12
  • 28. Opinion Mining Explained Monday, December 3, 12
  • 29. Opinion Mining: State-of-the-Art Supervised learning using features such as: opinion words and phrases negation part-of-speech-tags dependency parsing Monday, December 3, 12
  • 30. Positive or negative? “I bought an iPhone a few days ago. It was such a nice phone. The touch screen was really cool. The voice quality was clear too. Although the battery life was not long, that is ok for me. However, my mother was mad with me as I did not tell her before I bought it. She also thought the phone was too expensive, and wanted me to return it to the shop. … ” EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY, IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS), 2010. Monday, December 3, 12
  • 31. IBM’s Watson HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXW Monday, December 3, 12
  • 32. NLP: Challenges Negation Messy text (twitter and SMS language) Domain adaptation Cross- and multi-document text analysis Resource-scarce languages Monday, December 3, 12
  • 33. NIF: Natural Language Processing Interchange Format Monday, December 3, 12
  • 36. NIF: Why do we need it? Integration of NLP tools Bridge between LOD and NLP communities Monday, December 3, 12
  • 37. NIF Claims 1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is compatible with all other tools, which implement NIF. 2. NIF achieves this interoperability by using and defining a most common denominator for annotations. This means that some standard annotations are required to be used. On the other hand NIF is flexible and allows the NLP tools to add any extra annotations at will. 3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you need to solve a certain task. 4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it: RDF makes data integration easy: URIs, LinkedData OWL is based on Description Logics (Types, Type inheritance) Availability of open data sets (access and licence) Reusability of Vocabularies and Ontologies Diverse serializations for annotations: XML, Turtle, RDFa+XHTML Scalable tool support (Databases, Reasoning) Data is flexible and can be queried / transformed in many ways Monday, December 3, 12
  • 38. Structural interoperability NIF specifies how to create an identifier for uniquely locating arbitrary substrings in a document either using offset- or context-hash-based URIs String ontology to describe Strings Structured Sentence Ontology Monday, December 3, 12
  • 39. Conceptual Interoperability Lemma and stem annotations are data type properties in the Structured Sentence Ontology POS tags use OLiA (Ontologies or Linguistic Annotations) NER tags use Semantic Content Management System (SCMS) EU Project Monday, December 3, 12
  • 40. Access Interoperability Main interface: wrapper to NIF Web service IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG Monday, December 3, 12
  • 41. NLP/NIF: Wrap up NLP History and tasks Machine learning 101 Use-cases NER, relation finding and opinion mining Interoperability NLP results with NIF Monday, December 3, 12
  • 42. Further reading/Tools Peter Jackson and Isabelle Moulinier (2007)Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins. ISBN: 9027249938 ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics Machine learning: WEKA Natural language processing: GATE Monday, December 3, 12