SlideShare ist ein Scribd-Unternehmen logo
1 von 32
ML Applications: 1st Session
Introduction to Natural Language Processing (NLP)
Alia Hamwi
What is NLP?
• Natural Language Processing (NLP) is a field in Artificial Intelligence
(AI) devoted to creating computers that use natural language as input
and/or output.
What is NLP?
• The field of NLP involves making computers to perform useful tasks
with the natural languages humans use. The input and output of an
NLP system can be:
• Speech
• Written Text
NLP Applications
• Data-mining and analytics of weblogs, microblogs, discussion forums,
user reviews, and other forms of user-generated media.
NLP Applications
• Conversational agents Combine
• Speech recognition/synthesis
• Question answering
• From the web and from structured information sources (freebase, dbpedia, etc.)
• Commands identification for agent-like abilities
• Create/edit calendar entries
• Reminders
• Directions
• Invoking/interacting with other apps
NLP Applications
• Translation
- Google.
-DIRA (From English 2 Egyptian Dialect).
DIRA (From English 2 Egyptian Dialect)
https://aclanthology.org/I13-2004.pdf
NLP Applications
• Classifiers: classify a set of document into categories, (as email spam
filters)
• Information Retrieval: find relevant documents to a given query.
(search engines)
• Summarization: Produce a readable summary, e.g., news about oil
today.
• Spelling checkers, grammar checkers, auto-filling, ..... and more
Linguistics Levels of Analysis/Ambiguity
• Phonology ‫الصوتي‬
• Speech audio signal to phonemes sounds / letters / pronunciation
• Ambiguity (two, too,‫سائد‬,‫صائد‬, “I scream” / “Ice cream”)
• Morphology ‫الصرفي‬
• the structure of words.
• Inflection (e.g. “I”, “my”, “me”; “eat”, “eats”, “ate”, “eaten”)
• Derivation (e.g. “teach”, “teacher”, “‫”كتب‬, “‫”كاتب‬,” friendly “)
• Ambiguity (‫كوارث‬, Unionized)
Linguistics Levels of Analysis/Ambiguity
• Syntax ‫القواعدي‬
• grammar, how these sequences are structured
• Part-of-speech (noun, verb, adjective, preposition, etc.)
• Phrase structure (e.g. noun phrase, verb phrase)
• Ambiguity
Linguistics Levels of Analysis
• Semantics ‫الداللي‬
• Meaning of a word
• Ambiguity ( “board”, “book”,” ‫”عين‬ (
• Dialogue
• Meaning and inter-relation between sentences
Common NLP Tasks
• Word tokenization
• Sentence boundary detection
• Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
• Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location, organization; domain
terminologies)
• Parsing
• to identify the syntactic structure of a sentence
• Semantic analysis
• to derive the meaning of a sentence
NLP Task : Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to
each word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
NLP Task : Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
• e.g. “U.N. official Ekeus heads for Baghdad.”
NLP Task : Named Entity Recognition (NER)
NLP Task : Parsing and dependency parsing
• Shallow (or Partial) parsing identifies the (base) syntactic phases in a
sentence.
• After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[NP He] [v saw] [NP the big dog]
[PER Bill Gates] founded [ORG Microsoft].
found
Bill Gates Microsoft
nsubj dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)
NLP Task : Information Extraction
• Identify specific pieces of information (data) in an unstructured or
semi-structured text
• Transform unstructured information in a corpus of texts or web
pages into a structured database (or templates)
• Applied to various types of text, e.g.
• Newspaper
articles
• Scientific
articles
• Web pages
• etc.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production in January 1990 with production of 20,000
iron and “metal wood” clubs a month.
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
template filling
NLP Pipeline
NLP Pipeline: Data Collection
• Ideal Setting: We have everything needed.
• Labels and Annotations
• Very often we are dealing with less-than-ideal scenarios
• Initial datasets with limited annotations/labels
• Initial datasets labeled based on regular expressions or heuristics
• Public datasets (cf. Google Dataset Search or kaggle)
• Scrape data
NLP Pipeline: Text Cleaning
• Extracting raw texts from the input data
• HTML
• PDF
• Relevant vs. irrelevant information
• non-textual information
• markup
• metadata
• Encoding format
NLP Pipeline: Preprocessing
• Sentence segmentation
• Word tokenization
• Frequent preprocessing
• Stopword removal
• Stemming and/or lemmatization
• Digits/Punctuaions removal
• Case normalization
• Arabic: Remove Diacritic
• Remove redundant spaces
NLP Pipeline:
Feature Engineering/text representation
• Feature Engineering for Classical ML
• Bag-of-words representations
• Domain-specific word frequency lists
• Handcrafted features based on domain-specific knowledge
• Feature Engineering for DL
• DL directly takes the texts as inputs to the model.
• The DL model is capable of learning features from the texts (e.g.,
embeddings)
• The price is that the model is often less interpretable.
NLP Pipeline: Bag of Words Model (Binary)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into binary representations.
NLP Pipeline: Bag of Words Model (Count)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into numeric representations.
NLP Pipeline: Bag of Words Model
• Issues with Bag-of-Words Text Representation
• Word order is ignored.
• Raw absolute frequency counts of words do not necessarily represent the
meaning of the text properly.
NLP Pipeline: TF-IDF Model
• TF-IDF model is an extension of the bag-of-words model, whose main
objective is to adjust the raw frequency counts by considering the
dispersion of the words in the corpus.
• Dispersion refers to how evenly each word/term is distributed across
different documents of the corpus.
• Interaction between Word Raw Frequency Counts and Dispersion:
• Given a high-frequency word:
• If the word is widely dispersed across different documents of the corpus (i.e., high dispersion)
• it is more likely to be semantically general.
• If the word is mostly centralized in a limited set of documents in the corpus (i.e., low
dispersion)
• it is more likely to be topic-specific.
• Dispersion rates of words can be used as weights for the importance of
word frequency counts.
NLP Pipeline: TF-IDF Model
TF-IDF = TF * IDF
NLP Pipeline:Modeling & Evalution
• More details in (Week2 - Week3 - Week6) in the program.
Further Resources..
• Deep Learning for NLP in Python – DataCamp
https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python
• Natural Language Processing Specialization – Coursera
https://www.coursera.org/specializations/natural-language-processing
• Speech and Language Processing – Book
https://web.stanford.edu/~jurafsky/slp3/
Any Question ?
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
NLP Applications
NLP ApplicationsNLP Applications
NLP Applications
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Language models
Language modelsLanguage models
Language models
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
NLP
NLPNLP
NLP
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 

Ähnlich wie Introduction to natural language processing (NLP)

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
SHIBDASDUTTA
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
rohithprabhas1
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
tanishamahajan11
 

Ähnlich wie Introduction to natural language processing (NLP) (20)

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
NLP todo
NLP todoNLP todo
NLP todo
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 

Mehr von Alia Hamwi (12)

Teens In AI-Alia.pptx
Teens In AI-Alia.pptxTeens In AI-Alia.pptx
Teens In AI-Alia.pptx
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Protein Remote Homology Detection
Protein Remote Homology DetectionProtein Remote Homology Detection
Protein Remote Homology Detection
 
Model-driven architecture (MDA)
Model-driven architecture (MDA) Model-driven architecture (MDA)
Model-driven architecture (MDA)
 
Intelligent alert system for the driver
Intelligent alert system for the driverIntelligent alert system for the driver
Intelligent alert system for the driver
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Optical Neural Network
Optical Neural NetworkOptical Neural Network
Optical Neural Network
 
Human vs machine
Human vs machineHuman vs machine
Human vs machine
 
Introduction To Robotics Challenges
Introduction To Robotics ChallengesIntroduction To Robotics Challenges
Introduction To Robotics Challenges
 
Design Pattern (Strategy & Template)
Design Pattern (Strategy & Template)Design Pattern (Strategy & Template)
Design Pattern (Strategy & Template)
 
Big Data in Customer Relationship Management (CRM)
Big Data in Customer Relationship Management (CRM)Big Data in Customer Relationship Management (CRM)
Big Data in Customer Relationship Management (CRM)
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Introduction to natural language processing (NLP)

  • 1. ML Applications: 1st Session Introduction to Natural Language Processing (NLP) Alia Hamwi
  • 2. What is NLP? • Natural Language Processing (NLP) is a field in Artificial Intelligence (AI) devoted to creating computers that use natural language as input and/or output.
  • 3. What is NLP? • The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be: • Speech • Written Text
  • 4. NLP Applications • Data-mining and analytics of weblogs, microblogs, discussion forums, user reviews, and other forms of user-generated media.
  • 5. NLP Applications • Conversational agents Combine • Speech recognition/synthesis • Question answering • From the web and from structured information sources (freebase, dbpedia, etc.) • Commands identification for agent-like abilities • Create/edit calendar entries • Reminders • Directions • Invoking/interacting with other apps
  • 6. NLP Applications • Translation - Google. -DIRA (From English 2 Egyptian Dialect).
  • 7. DIRA (From English 2 Egyptian Dialect) https://aclanthology.org/I13-2004.pdf
  • 8. NLP Applications • Classifiers: classify a set of document into categories, (as email spam filters) • Information Retrieval: find relevant documents to a given query. (search engines) • Summarization: Produce a readable summary, e.g., news about oil today. • Spelling checkers, grammar checkers, auto-filling, ..... and more
  • 9. Linguistics Levels of Analysis/Ambiguity • Phonology ‫الصوتي‬ • Speech audio signal to phonemes sounds / letters / pronunciation • Ambiguity (two, too,‫سائد‬,‫صائد‬, “I scream” / “Ice cream”) • Morphology ‫الصرفي‬ • the structure of words. • Inflection (e.g. “I”, “my”, “me”; “eat”, “eats”, “ate”, “eaten”) • Derivation (e.g. “teach”, “teacher”, “‫”كتب‬, “‫”كاتب‬,” friendly “) • Ambiguity (‫كوارث‬, Unionized)
  • 10. Linguistics Levels of Analysis/Ambiguity • Syntax ‫القواعدي‬ • grammar, how these sequences are structured • Part-of-speech (noun, verb, adjective, preposition, etc.) • Phrase structure (e.g. noun phrase, verb phrase) • Ambiguity
  • 11. Linguistics Levels of Analysis • Semantics ‫الداللي‬ • Meaning of a word • Ambiguity ( “board”, “book”,” ‫”عين‬ ( • Dialogue • Meaning and inter-relation between sentences
  • 12. Common NLP Tasks • Word tokenization • Sentence boundary detection • Part-of-speech (POS) tagging • to identify the part-of-speech (e.g. noun, verb) of each word • Named Entity (NE) recognition • to identify proper nouns (e.g. names of person, location, organization; domain terminologies) • Parsing • to identify the syntactic structure of a sentence • Semantic analysis • to derive the meaning of a sentence
  • 13. NLP Task : Part-Of-Speech (POS) Tagging • POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj
  • 14. NLP Task : Named Entity Recognition (NER) • NER is to process a text and identify named entities in a sentence • e.g. “U.N. official Ekeus heads for Baghdad.”
  • 15. NLP Task : Named Entity Recognition (NER)
  • 16. NLP Task : Parsing and dependency parsing • Shallow (or Partial) parsing identifies the (base) syntactic phases in a sentence. • After NEs are identified, dependency parsing is often applied to extract the syntactic/dependency relations between the NEs. [NP He] [v saw] [NP the big dog] [PER Bill Gates] founded [ORG Microsoft]. found Bill Gates Microsoft nsubj dobj Dependency Relations nsubj(Bill Gates, found) dobj(found, Microsoft)
  • 17. NLP Task : Information Extraction • Identify specific pieces of information (data) in an unstructured or semi-structured text • Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) • Applied to various types of text, e.g. • Newspaper articles • Scientific articles • Web pages • etc.
  • 18. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 template filling
  • 20. NLP Pipeline: Data Collection • Ideal Setting: We have everything needed. • Labels and Annotations • Very often we are dealing with less-than-ideal scenarios • Initial datasets with limited annotations/labels • Initial datasets labeled based on regular expressions or heuristics • Public datasets (cf. Google Dataset Search or kaggle) • Scrape data
  • 21. NLP Pipeline: Text Cleaning • Extracting raw texts from the input data • HTML • PDF • Relevant vs. irrelevant information • non-textual information • markup • metadata • Encoding format
  • 22. NLP Pipeline: Preprocessing • Sentence segmentation • Word tokenization • Frequent preprocessing • Stopword removal • Stemming and/or lemmatization • Digits/Punctuaions removal • Case normalization • Arabic: Remove Diacritic • Remove redundant spaces
  • 23. NLP Pipeline: Feature Engineering/text representation • Feature Engineering for Classical ML • Bag-of-words representations • Domain-specific word frequency lists • Handcrafted features based on domain-specific knowledge • Feature Engineering for DL • DL directly takes the texts as inputs to the model. • The DL model is capable of learning features from the texts (e.g., embeddings) • The price is that the model is often less interpretable.
  • 24. NLP Pipeline: Bag of Words Model (Binary) • Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into binary representations.
  • 25. NLP Pipeline: Bag of Words Model (Count) • Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into numeric representations.
  • 26. NLP Pipeline: Bag of Words Model • Issues with Bag-of-Words Text Representation • Word order is ignored. • Raw absolute frequency counts of words do not necessarily represent the meaning of the text properly.
  • 27. NLP Pipeline: TF-IDF Model • TF-IDF model is an extension of the bag-of-words model, whose main objective is to adjust the raw frequency counts by considering the dispersion of the words in the corpus. • Dispersion refers to how evenly each word/term is distributed across different documents of the corpus. • Interaction between Word Raw Frequency Counts and Dispersion: • Given a high-frequency word: • If the word is widely dispersed across different documents of the corpus (i.e., high dispersion) • it is more likely to be semantically general. • If the word is mostly centralized in a limited set of documents in the corpus (i.e., low dispersion) • it is more likely to be topic-specific. • Dispersion rates of words can be used as weights for the importance of word frequency counts.
  • 28. NLP Pipeline: TF-IDF Model TF-IDF = TF * IDF
  • 29. NLP Pipeline:Modeling & Evalution • More details in (Week2 - Week3 - Week6) in the program.
  • 30. Further Resources.. • Deep Learning for NLP in Python – DataCamp https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python • Natural Language Processing Specialization – Coursera https://www.coursera.org/specializations/natural-language-processing • Speech and Language Processing – Book https://web.stanford.edu/~jurafsky/slp3/