SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
NLTK: Natural Language Toolkit
  Overview and Application
               Jimmy Lai
          jimmy.lai@oi-sys.com
     Software Engineer @ 引京聚點
              2012/06/09


                                 1
Outline
1.   An application based on NLP: 聚寶評
2.   Introduction to Natural Language Processing
3.   Brief History of NLTK
4.   NLTK




                                                   2
聚寶評 www.ezpao.com

      美食搜尋引擎




搜尋各大部落格食記

                    3
聚寶評 www.ezpao.com

     語意分析搜尋引擎




                    4
評論主題分析




網友分享菜分析




 正評/負評分析




                    5
手機版            網友分享菜+評論分析

m.ezpao.com




              網友分享菜+正負評分析




                            6
Natural Language Processing (NLP)
•    語音識別(Speech recognition)
•    詞性標註(Part-of-speech tagging)
•    句法分析(Parsing)
•    自然語言生成(Natural language generation)
•    文本分類(Text classification)
•    信息抽取(Information extraction)
•    機器翻譯(Machine translation)
•    文字蘊涵(Textual entailment)
    via Wikipedia

                                           7
NLTK: Natural Language Toolkit
• http://www.nltk.org/
• Author: Steven Bird, Edward Loper, Ewan Klein
• Originally developed for class student has
  background either in computer science or
  linguistics.
• Currently:
  – Education: over 100 courses in 23 countries.
  – Research: over 250 papers cites NLTK.

                                                   8
Outline
1.   An application based on NLP: 聚寶評
2.   Introduction to Natural Language Processing
3.   Brief History of NLTK
4.   NLTK:
     a. Installation
     b. Annotated Text Corpora
     c. Text Tokenization, Normalization, Analysis,
        Distribution Analysis
     d. Part-of-speech Tagging
     e. Text Classification
     f. Named Entity Recognition
                                                      9
Install NLTK
Python 2.6+

pip install numpy
pip install matplotlib
pip install nltk




                         10
Annotated Text Corpora (1/3)
           nltk.corpus
• Corpus: 語料庫,含有某種結構化標記的資
  料集合,可能包含多種語言。
• 例如:
 – stopwords: 常見字字典
 – sinica_treebank: 中文語句結構標記語料庫
 – brown: 包含15種分類及詞性標記的英語語料庫
 – wordnet: 包含詞性、同義反義的英語字典



                                  11
Annotated Text Corpora (2/3)
                   nltk.corpus
                                                #Chinese treebank
import nltk                                     nltk.corpus.sinica_treebank
#download corpus on demand
nltk.download()                                 #Examples
                                                (嘉珍, Nba)
#stopwords                                      (和, Caa)
nltk.corpus.stopwords.words()                   (我, Nhaa)
nltk.corpus.stopwords.words('english')          (住在, VC1)
nltk.corpus.stopwords.words('french')           (同一條, DM)
                                                (巷子, Nab)
some_english_stopwords = ['most', 'me',
'below', 'when', 'which', 'what', 'of', 'it',   (我們, Nhaa)
'very', 'our']                                  (是, V_11)
                                                (鄰居, Nab)

                                                (也, Dbb)
                                                (是, V_11)
                                                (同班, Nv3)
                                                (同學, Nab)                     12
Annotated Text Corpora (3/3)
                     nltk.corpus
ACE Named Entity Chunker (Maximum entropy)         Portuguese Treebank
Australian Broadcasting Commission 2006            Gazeteer Lists
                                                   Genesis Corpus
Alpino Dutch Treebank
                                                   Project Gutenberg Selections
BioCreAtIvE (Critical Assessment of Information    NIST IE-ER DATA SAMPLE
Extraction Systems in Biology)                     C-Span Inaugural Address Corpus
Brown Corpus                                       Indian Language POS-Tagged Corpus
Brown Corpus (TEI XML Version)                     JEITA Public Morphologically Tagged Corpus (in
CESS-CAT Treebank                                  ChaSen format)
                                                   PC-KIMMO Data Files
CESS-ESP Treebank
                                                   KNB Corpus (Annotated blog corpus)
Chat-80 Data Files                                 Language Id Corpus
City Database                                      Lin's Dependency Thesaurus
The Carnegie Mellon Pronouncing Dictionary (0.6)   MAC-MORPHO: Brazilian Portuguese news text with
ComTrans Corpus Sample                             part-of-speech tags
                                                   Machado de Assis -- Obra Completa
CONLL 2000 Chunking Corpus
                                                   Sentiment Polarity Dataset Version 2.0
CONLL 2002 Named Entity Recognition Corpus         Names Corpus, Version 1.3 (1994-03-29)
Dependency Treebanks from CoNLL 2007 (Catalan      NomBank Corpus 1.0
and Basque Subset)                                 NPS Chat
Dependency Parsed Treebank                         Paradigm Corpus
Sample European Parliament Proceedings Parallel
Corpus


                                                                                                     13
Text Tokenization
                      nltk.tokenize
Web Text Processing Flow
          HTML               ASCII             Text           Vocabulary

from urllib import urlopen
html = urlopen(url).read()
raw = nltk.clean_html(html)
sents = nltk.sent_tokenize(raw)
tokens = []                         # wordpunct_tokenize: ['3', '.', '33']
for sent in sents:                  # word_tokenize: ['3.33']
    tokens.extend(nltk.word_tokenize(sent))
text = nltk.Text(tokens)
words = [word.lower() for word in text]
vocab = sorted(set(words))

                                                                             14
Text Normalization (1/2)
           nltk.stem
• Stem: 將單字(現在式、過去式、單複數)還
  原成原型。可以將不同形式的單字歸類為
  同一個單字。
• 著名演算法:Porter Stemmer




                               15
Text Normalization (2/2)
       nltk.stem




                           16
Text Analysis
  nltk.text




                17
Text Distribution Analysis
     nltk.probability




                             18
Part-of-speech Tagging (1/3)
               nltk.tag
• Part of Speech Tagging: 詞性標記,
  標記每個單字的詞性。
• 同一單字的不同詞性其語義不同,
  如Book名詞是書,動詞是預定。
• 透過POS Tagging,可以賦予文字更
  多語義資訊。



                                    19
Part-of-speech Tagging (2/3)
          nltk.tag




                               20
Part-of-speech Tagging (3/3)
          nltk.tag




                               21
Text Classification (1/3)
              nltk.classify
• Text Classification: 文字分類,分析文字後將
  文字分到預先定義的類別裡。
• 基於統計的機器學習演算法,著名的演算
  法為:
 – Naïve Bayes Classifier
 – Decision Tree
 – Support Vector Machine



                                    22
Text Classification (2/3)
                      nltk.classify
Machine Learning Approach Work Flow
Training
    Text with      Feature                 Learning    Classifier
                               Features
      Label       Generation              Algorithms    Model



Prediction
      Text                                Classifier
                   Feature
    without                    Features                 Label
                  Generation               Model
     Label




                                                                    23
Text Classification (3/3)
      nltk.classify




                            24
Named Entity Recognition (NER) (1/2)
       nltk.tag, nltk.chunk
• Named Entity Recognition: 從文字中擷取出
  命名實體,命名實體是具有完整語義的複
  合單字。例如:人名、地名、事件。

 NER General Work Flow

  Sentence                                  Named Entity    Relation
               Tokenization   POS tagging
Segmentation                                 Recognition   Recognition




                                                                     25
Named Entity Recognition (NER) (2/2)
       nltk.tag, nltk.chunk




                                   26
Reference
1. Steven Bird, Ewan Klein, and Edward Loper,
   “Natural Language Processing with Python”,
   2009. #includes: Python + NLP + NLTK
2. Jacob Perkins, “Python Text Processing with
   NLTK 2.0 Cookbook”, 2010.
3. Matthew A. Russell, “Mining the Social Web”,
   2011.
4. Loper, E., & Bird, S. (2002). NLTK: The Natural
   Language Toolkit. Proceedings of the ACL02
   Workshop on Effective tools and methodologies
   for teaching natural language processing and
   computational linguistics.
                                                 27
Thank you for your attention.
           Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com

                                      28

Weitere ähnliche Inhalte

Was ist angesagt?

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTKFrancesco Bruni
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture noteSusang Kim
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
基于 Google protobuf 的 webgame 网络协议设计
基于 Google protobuf 的 webgame 网络协议设计基于 Google protobuf 的 webgame 网络协议设计
基于 Google protobuf 的 webgame 网络协议设计勇浩 赖
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019 Alexis Agahi
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Grokking the REST Architectural Style
Grokking the REST Architectural StyleGrokking the REST Architectural Style
Grokking the REST Architectural StyleBen Ramsey
 
You Look Like You Could Use Some REST!
You Look Like You Could Use Some REST!You Look Like You Could Use Some REST!
You Look Like You Could Use Some REST!Ben Ramsey
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric SystemErin Dees
 
[Reactive] Programming with [Rx]ROS
[Reactive] Programming with [Rx]ROS[Reactive] Programming with [Rx]ROS
[Reactive] Programming with [Rx]ROSAndrzej Wasowski
 
Thnad's Revenge
Thnad's RevengeThnad's Revenge
Thnad's RevengeErin Dees
 
Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010fmguler
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekingeProf. Wim Van Criekinge
 
How to make DSL
How to make DSLHow to make DSL
How to make DSLYukio Goto
 
Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM CompilerErin Dees
 

Was ist angesagt? (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture note
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
基于 Google protobuf 的 webgame 网络协议设计
基于 Google protobuf 的 webgame 网络协议设计基于 Google protobuf 的 webgame 网络协议设计
基于 Google protobuf 的 webgame 网络协议设计
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Grokking the REST Architectural Style
Grokking the REST Architectural StyleGrokking the REST Architectural Style
Grokking the REST Architectural Style
 
You Look Like You Could Use Some REST!
You Look Like You Could Use Some REST!You Look Like You Could Use Some REST!
You Look Like You Could Use Some REST!
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric System
 
Py jail talk
Py jail talkPy jail talk
Py jail talk
 
[Reactive] Programming with [Rx]ROS
[Reactive] Programming with [Rx]ROS[Reactive] Programming with [Rx]ROS
[Reactive] Programming with [Rx]ROS
 
Thnad's Revenge
Thnad's RevengeThnad's Revenge
Thnad's Revenge
 
2 P Seminar
2 P Seminar2 P Seminar
2 P Seminar
 
Flower and celery
Flower and celeryFlower and celery
Flower and celery
 
Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
How to make DSL
How to make DSLHow to make DSL
How to make DSL
 
Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM Compiler
 

Andere mochten auch

Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge BaseJimmy Lai
 
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介KCS Keio Computer Society
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copyNakul Sharma
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 

Andere mochten auch (15)

Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge Base
 
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介
“Why Should I Trust You?” Explaining the Predictions of Any Classifierの紹介
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copy
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 

Ähnlich wie Nltk natural language toolkit overview and application @ PyCon.tw 2012

Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azurecloudbeatsch
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptxjatinchand3
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Petros Tsonis
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsJitendra Patil
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)STI International
 

Ähnlich wie Nltk natural language toolkit overview and application @ PyCon.tw 2012 (20)

Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
Nltk
NltkNltk
Nltk
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical Tools
 
Intro to KotlinNLP
Intro to KotlinNLPIntro to KotlinNLP
Intro to KotlinNLP
 
Introduction to KotlinNLP
Introduction to KotlinNLPIntroduction to KotlinNLP
Introduction to KotlinNLP
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)
 

Mehr von Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdfJimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesJimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringJimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Jimmy Lai
 

Mehr von Jimmy Lai (6)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 

Kürzlich hochgeladen

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Kürzlich hochgeladen (20)

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Nltk natural language toolkit overview and application @ PyCon.tw 2012

  • 1. NLTK: Natural Language Toolkit Overview and Application Jimmy Lai jimmy.lai@oi-sys.com Software Engineer @ 引京聚點 2012/06/09 1
  • 2. Outline 1. An application based on NLP: 聚寶評 2. Introduction to Natural Language Processing 3. Brief History of NLTK 4. NLTK 2
  • 3. 聚寶評 www.ezpao.com 美食搜尋引擎 搜尋各大部落格食記 3
  • 4. 聚寶評 www.ezpao.com 語意分析搜尋引擎 4
  • 6. 手機版 網友分享菜+評論分析 m.ezpao.com 網友分享菜+正負評分析 6
  • 7. Natural Language Processing (NLP) • 語音識別(Speech recognition) • 詞性標註(Part-of-speech tagging) • 句法分析(Parsing) • 自然語言生成(Natural language generation) • 文本分類(Text classification) • 信息抽取(Information extraction) • 機器翻譯(Machine translation) • 文字蘊涵(Textual entailment) via Wikipedia 7
  • 8. NLTK: Natural Language Toolkit • http://www.nltk.org/ • Author: Steven Bird, Edward Loper, Ewan Klein • Originally developed for class student has background either in computer science or linguistics. • Currently: – Education: over 100 courses in 23 countries. – Research: over 250 papers cites NLTK. 8
  • 9. Outline 1. An application based on NLP: 聚寶評 2. Introduction to Natural Language Processing 3. Brief History of NLTK 4. NLTK: a. Installation b. Annotated Text Corpora c. Text Tokenization, Normalization, Analysis, Distribution Analysis d. Part-of-speech Tagging e. Text Classification f. Named Entity Recognition 9
  • 10. Install NLTK Python 2.6+ pip install numpy pip install matplotlib pip install nltk 10
  • 11. Annotated Text Corpora (1/3) nltk.corpus • Corpus: 語料庫,含有某種結構化標記的資 料集合,可能包含多種語言。 • 例如: – stopwords: 常見字字典 – sinica_treebank: 中文語句結構標記語料庫 – brown: 包含15種分類及詞性標記的英語語料庫 – wordnet: 包含詞性、同義反義的英語字典 11
  • 12. Annotated Text Corpora (2/3) nltk.corpus #Chinese treebank import nltk nltk.corpus.sinica_treebank #download corpus on demand nltk.download() #Examples (嘉珍, Nba) #stopwords (和, Caa) nltk.corpus.stopwords.words() (我, Nhaa) nltk.corpus.stopwords.words('english') (住在, VC1) nltk.corpus.stopwords.words('french') (同一條, DM) (巷子, Nab) some_english_stopwords = ['most', 'me', 'below', 'when', 'which', 'what', 'of', 'it', (我們, Nhaa) 'very', 'our'] (是, V_11) (鄰居, Nab) (也, Dbb) (是, V_11) (同班, Nv3) (同學, Nab) 12
  • 13. Annotated Text Corpora (3/3) nltk.corpus ACE Named Entity Chunker (Maximum entropy) Portuguese Treebank Australian Broadcasting Commission 2006 Gazeteer Lists Genesis Corpus Alpino Dutch Treebank Project Gutenberg Selections BioCreAtIvE (Critical Assessment of Information NIST IE-ER DATA SAMPLE Extraction Systems in Biology) C-Span Inaugural Address Corpus Brown Corpus Indian Language POS-Tagged Corpus Brown Corpus (TEI XML Version) JEITA Public Morphologically Tagged Corpus (in CESS-CAT Treebank ChaSen format) PC-KIMMO Data Files CESS-ESP Treebank KNB Corpus (Annotated blog corpus) Chat-80 Data Files Language Id Corpus City Database Lin's Dependency Thesaurus The Carnegie Mellon Pronouncing Dictionary (0.6) MAC-MORPHO: Brazilian Portuguese news text with ComTrans Corpus Sample part-of-speech tags Machado de Assis -- Obra Completa CONLL 2000 Chunking Corpus Sentiment Polarity Dataset Version 2.0 CONLL 2002 Named Entity Recognition Corpus Names Corpus, Version 1.3 (1994-03-29) Dependency Treebanks from CoNLL 2007 (Catalan NomBank Corpus 1.0 and Basque Subset) NPS Chat Dependency Parsed Treebank Paradigm Corpus Sample European Parliament Proceedings Parallel Corpus 13
  • 14. Text Tokenization nltk.tokenize Web Text Processing Flow HTML ASCII Text Vocabulary from urllib import urlopen html = urlopen(url).read() raw = nltk.clean_html(html) sents = nltk.sent_tokenize(raw) tokens = [] # wordpunct_tokenize: ['3', '.', '33'] for sent in sents: # word_tokenize: ['3.33'] tokens.extend(nltk.word_tokenize(sent)) text = nltk.Text(tokens) words = [word.lower() for word in text] vocab = sorted(set(words)) 14
  • 15. Text Normalization (1/2) nltk.stem • Stem: 將單字(現在式、過去式、單複數)還 原成原型。可以將不同形式的單字歸類為 同一個單字。 • 著名演算法:Porter Stemmer 15
  • 16. Text Normalization (2/2) nltk.stem 16
  • 17. Text Analysis nltk.text 17
  • 18. Text Distribution Analysis nltk.probability 18
  • 19. Part-of-speech Tagging (1/3) nltk.tag • Part of Speech Tagging: 詞性標記, 標記每個單字的詞性。 • 同一單字的不同詞性其語義不同, 如Book名詞是書,動詞是預定。 • 透過POS Tagging,可以賦予文字更 多語義資訊。 19
  • 22. Text Classification (1/3) nltk.classify • Text Classification: 文字分類,分析文字後將 文字分到預先定義的類別裡。 • 基於統計的機器學習演算法,著名的演算 法為: – Naïve Bayes Classifier – Decision Tree – Support Vector Machine 22
  • 23. Text Classification (2/3) nltk.classify Machine Learning Approach Work Flow Training Text with Feature Learning Classifier Features Label Generation Algorithms Model Prediction Text Classifier Feature without Features Label Generation Model Label 23
  • 24. Text Classification (3/3) nltk.classify 24
  • 25. Named Entity Recognition (NER) (1/2) nltk.tag, nltk.chunk • Named Entity Recognition: 從文字中擷取出 命名實體,命名實體是具有完整語義的複 合單字。例如:人名、地名、事件。 NER General Work Flow Sentence Named Entity Relation Tokenization POS tagging Segmentation Recognition Recognition 25
  • 26. Named Entity Recognition (NER) (2/2) nltk.tag, nltk.chunk 26
  • 27. Reference 1. Steven Bird, Ewan Klein, and Edward Loper, “Natural Language Processing with Python”, 2009. #includes: Python + NLP + NLTK 2. Jacob Perkins, “Python Text Processing with NLTK 2.0 Cookbook”, 2010. 3. Matthew A. Russell, “Mining the Social Web”, 2011. 4. Loper, E., & Bird, S. (2002). NLTK: The Natural Language Toolkit. Proceedings of the ACL02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics. 27
  • 28. Thank you for your attention. Q&A We are hiring! • 核心引擎演算法研發工程師 • 系統研發工程師 • 網路應用研發工程師 Oxygen Intelligence Taiwan Limited 引京聚點 知識結構搜索股份有限公司 • 公司簡介: http://www.ezpao.com/about/ • 職缺簡介: http://www.ezpao.com/join/ • 請將履歷寄到 jimmy.lai@oi-sys.com 28