The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Handwritten Text Recognition for manuscripts and early printed texts
Introduction to natural language processing (NLP)
1. ML Applications: 1st Session
Introduction to Natural Language Processing (NLP)
Alia Hamwi
2. What is NLP?
• Natural Language Processing (NLP) is a field in Artificial Intelligence
(AI) devoted to creating computers that use natural language as input
and/or output.
3. What is NLP?
• The field of NLP involves making computers to perform useful tasks
with the natural languages humans use. The input and output of an
NLP system can be:
• Speech
• Written Text
4. NLP Applications
• Data-mining and analytics of weblogs, microblogs, discussion forums,
user reviews, and other forms of user-generated media.
5. NLP Applications
• Conversational agents Combine
• Speech recognition/synthesis
• Question answering
• From the web and from structured information sources (freebase, dbpedia, etc.)
• Commands identification for agent-like abilities
• Create/edit calendar entries
• Reminders
• Directions
• Invoking/interacting with other apps
7. DIRA (From English 2 Egyptian Dialect)
https://aclanthology.org/I13-2004.pdf
8. NLP Applications
• Classifiers: classify a set of document into categories, (as email spam
filters)
• Information Retrieval: find relevant documents to a given query.
(search engines)
• Summarization: Produce a readable summary, e.g., news about oil
today.
• Spelling checkers, grammar checkers, auto-filling, ..... and more
10. Linguistics Levels of Analysis/Ambiguity
• Syntax القواعدي
• grammar, how these sequences are structured
• Part-of-speech (noun, verb, adjective, preposition, etc.)
• Phrase structure (e.g. noun phrase, verb phrase)
• Ambiguity
11. Linguistics Levels of Analysis
• Semantics الداللي
• Meaning of a word
• Ambiguity ( “board”, “book”,” ”عين (
• Dialogue
• Meaning and inter-relation between sentences
12. Common NLP Tasks
• Word tokenization
• Sentence boundary detection
• Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
• Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location, organization; domain
terminologies)
• Parsing
• to identify the syntactic structure of a sentence
• Semantic analysis
• to derive the meaning of a sentence
13. NLP Task : Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to
each word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
14. NLP Task : Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
• e.g. “U.N. official Ekeus heads for Baghdad.”
16. NLP Task : Parsing and dependency parsing
• Shallow (or Partial) parsing identifies the (base) syntactic phases in a
sentence.
• After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[NP He] [v saw] [NP the big dog]
[PER Bill Gates] founded [ORG Microsoft].
found
Bill Gates Microsoft
nsubj dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)
17. NLP Task : Information Extraction
• Identify specific pieces of information (data) in an unstructured or
semi-structured text
• Transform unstructured information in a corpus of texts or web
pages into a structured database (or templates)
• Applied to various types of text, e.g.
• Newspaper
articles
• Scientific
articles
• Web pages
• etc.
18. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production in January 1990 with production of 20,000
iron and “metal wood” clubs a month.
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
template filling
20. NLP Pipeline: Data Collection
• Ideal Setting: We have everything needed.
• Labels and Annotations
• Very often we are dealing with less-than-ideal scenarios
• Initial datasets with limited annotations/labels
• Initial datasets labeled based on regular expressions or heuristics
• Public datasets (cf. Google Dataset Search or kaggle)
• Scrape data
21. NLP Pipeline: Text Cleaning
• Extracting raw texts from the input data
• HTML
• PDF
• Relevant vs. irrelevant information
• non-textual information
• markup
• metadata
• Encoding format
23. NLP Pipeline:
Feature Engineering/text representation
• Feature Engineering for Classical ML
• Bag-of-words representations
• Domain-specific word frequency lists
• Handcrafted features based on domain-specific knowledge
• Feature Engineering for DL
• DL directly takes the texts as inputs to the model.
• The DL model is capable of learning features from the texts (e.g.,
embeddings)
• The price is that the model is often less interpretable.
24. NLP Pipeline: Bag of Words Model (Binary)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into binary representations.
25. NLP Pipeline: Bag of Words Model (Count)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into numeric representations.
26. NLP Pipeline: Bag of Words Model
• Issues with Bag-of-Words Text Representation
• Word order is ignored.
• Raw absolute frequency counts of words do not necessarily represent the
meaning of the text properly.
27. NLP Pipeline: TF-IDF Model
• TF-IDF model is an extension of the bag-of-words model, whose main
objective is to adjust the raw frequency counts by considering the
dispersion of the words in the corpus.
• Dispersion refers to how evenly each word/term is distributed across
different documents of the corpus.
• Interaction between Word Raw Frequency Counts and Dispersion:
• Given a high-frequency word:
• If the word is widely dispersed across different documents of the corpus (i.e., high dispersion)
• it is more likely to be semantically general.
• If the word is mostly centralized in a limited set of documents in the corpus (i.e., low
dispersion)
• it is more likely to be topic-specific.
• Dispersion rates of words can be used as weights for the importance of
word frequency counts.
30. Further Resources..
• Deep Learning for NLP in Python – DataCamp
https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python
• Natural Language Processing Specialization – Coursera
https://www.coursera.org/specializations/natural-language-processing
• Speech and Language Processing – Book
https://web.stanford.edu/~jurafsky/slp3/