2. Jacob Perkins
• Python Text Processing with NLTK 2.0 Cookbook
• streamhacker.com
• weotta.com
• text-processing.com
• @japerk
3. The Good
• Makes NLProc easier and more accessible
• Python (great learning language)
• Lots of documentation (and 2 books!)
• Designed for training custom models
• Includes many training corpora
• Many algorithms to experiment with
4. The Bad
• NLProc is hard
• Few out-of-the-box solutions (see Pattern)
• Not designed for big-data (see Mahout)
• Doesn’t have latest algorithms (see Scikits-Learn)
• No online or active learning algorithms
5. More Bad
• Doesn’t play nice with pip or easy_install
• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)
• Models can use a lot of memory (& disk if pickled)
6. The Awesome
• Great for education and research
• Lots of users & active community
• Extensible interfaces
• Training algorithms span human languages
7. More Awesome
• Trained models can be very fast
• Well known algorithms can be very accurate
• NLTK-Trainer (train models with 0 code)
• Corpus bootstrapping
8. Some Numbers
• 3 Classification Algorithms
• 9 Part-of-Speech Tagging Algorithms
• Stemming Algorithms for 15 Languages
• 5 Word Tokenization Algorithms
• Sentence Tokenizers for 16 Languages
• 60 included corpora
15. Training Chunkers
• train_chunker.py treebank_chunk
• train_chunker.py treebank_chunk --classifier
NaiveBayes
• train_chunker.py conll2000 --fileids train.txt
• Pickled models are saved in ~/nltk_data/chunkers/
16. Corpus Bootstrapping
• Guess & Correct easier than starting from scratch
• Use an existing model for initial guesses
• emoticons
‣ :) = “pos”
‣ :( = “neg”
• ratings
‣ 5 stars = “pos”
‣ 1 star = “neg”
17. Portuguese Phrase
Extraction & Classification
• similar to condensr.com
• Brazilian Portuguese
• aspect classification is easy with training corpus
• need chunked corpus for phrase extraction
• use mac_morpho & nltk-trainer to train initial tagger
• part-of-speech tag annotation is time consuming
• simplified tags are much easier
• bracketed phrases w/out pos tags