SlideShare ist ein Scribd-Unternehmen logo
1 von 119
Downloaden Sie, um offline zu lesen
Knowledge Representation
in
Digital Humanities
Antonio Jiménez Mavillard
Department of Modern Languages and Literatures
Western University
Lecture 9
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* Contents:
1. Why this lecture?
2. Discussion
3. Chapter 9
4. Assignment
5. Bibliography
2
Why this lecture?
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* This lecture...
· teaches some NLP techniques subject
to be applied to real problems
· presents another example of how DH put
together various disciplines (Linguistics,
Artificial Intelligence, Information
Science, Statistics...) to solve problems
3
Last assignment discussion
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* Time to...
· consolidate ideas and
concepts dealt in the readings
4
Chapter 9
Natural Language Processing
in Python
1. Preliminary theory
2. Word tagging and categorization
3. Text classification
4. Text information extraction
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard5
Chapter 9
1 Preliminary theory
1.1 Linguistics
1.2 Statistics
1.3 Artificial Intelligence
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard6
Chapter 9
2 Word tagging and categorization
2.1 Tagger
2.2 Automatic tagging
2.3 n-gram tagging
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard7
Chapter 9
3 Text classification
3.1 Supervised classification
3.2 Document classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard8
Chapter 9
4 Text information extraction
4.1 Information extraction
4.2 Entity recognition
4.3 Relation extraction
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard9
Preliminary theory
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard10
Linguistics
* Lexical categories
· nouns: people, places, things, concepts
· verbs: actions
· adjectives: describes nouns
· adverbs: modifies adjectives and verbs
· ...
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard11
Linguistics
* Lexical categories
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard12
Linguistics
* These word classes are also known as
part-of-speech
* They arise from simple analysis of the
distribution of words in text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard13
Statistics
* Frequency distribution
· Arrangement of the values that one or
more variables take in a sample
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard14
Statistics
* Frequency distribution
· Example: vocabulary in a text
+ how many times each word appears in
the text?
+ it is a “distribution” since it tells us
how the total number of word tokens
in the text are distributed across the
vocabulary items
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard15
Statistics
* Frequency distribution
· Example: vocabulary in a text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard16
Statistics
* Conditional frequency distribution
· A collection of frequency distributions,
each one for a different condition
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard17
Statistics
* Conditional frequency distribution
· Example: vocabulary in a text
+ when the texts of a corpus are
divided into several categories we can
maintain separate frequency
distributions for each category
+ the condition will often be the
category of the text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard18
Statistics
* Conditional frequency distribution
· Example: vocabulary in a text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard19
Artificial Intelligence
* Supervised vs unsupervised learning
· Supervised learning:
+ Possible results are known
+ Data is labeled
· Unsupervised learning:
+ Results are unknown
+ Data is clustered
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard20
Artificial Intelligence
* Decision trees
· Flowchart that selects labels for input
values
· Formed by decision and leaf nodes
· Decision nodes: check feature values
· Leaf nodes: assign labels
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard21
Artificial Intelligence
* Decision trees
· Example: “Going out?”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard22
Artificial Intelligence
* Naive Bayes classifiers
1. Begins by calculating the prior
probability of each label, determined by
checking the frequency of each label in
the training set
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard23
Artificial Intelligence
* Naive Bayes classifiers
2. The contribution from each feature is
combined with this prior probability, to
arrive at a likelihood estimate for each
label
3. The label whose likelihood estimate is
the highest is then assigned to the input
value
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard24
Artificial Intelligence
* Naive Bayes classifiers
· Example: document classification
Prior probability: close “Automotive”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard25
References
“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
Mitchell, Tom M. “Chapter 6: Bayesian Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Steven Bird, Ewan Klein, and Edward Loper. “Conditional Frequency Distributions.” Natural Language Processing with
Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Steven Bird, Ewan Klein, and Edward Loper. “Frequency Distributions.” Natural Language Processing with Python. O’Reilly
Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard26
Word tagging and classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard27
Tagger
* Processes a sequence of words, and
attaches a part of speech tag to each
word
* Procedure:
1. Tokenization
2. Tagging
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard28
Tagger
* Example 1:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard29
In [1]: text = 'And now for something completely different'
In [2]: tokens = nltk.word_tokenize(text)
In [3]: nltk.pos_tag(tokens)
Out[3]: 
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]
Tagger
* Example 2:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard30
In [1]: text = 'They refuse to permit us to obtain the 
refuse permit'
In [2]: tokens = nltk.word_tokenize(text)
In [3]: nltk.pos_tag(tokens)
Out[3]: 
[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]
Automatic tagging
* The tag of a word depends on the word
itself and its context within a sentence
* Working with data at the level of tagged
sentences rather than tagged words
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard31
Automatic tagging
* Loading data
· Example: tagged and non-tagged
sentences of “news” category
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard32
In [1]: from nltk.corpus import brown
In [2]: brown_tagged_sents =                              
brown.tagged_sents(categories='news')
In [3]: brown_sents = brown.sents(categories='news')
Automatic tagging
* Default tagger
· Chose the most likely tag
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard33
In [4]: tags = [tag for (word, tag) in 
brown.tagged_words(categories='news')]
In [4]: nltk.FreqDist(tags).max()
Out[4]: 'NN'
Automatic tagging
* Default tagger
· Assign the most likely tag to each token
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard34
In [5]: text = 'I do not like green eggs and ham, I do not 
like them Sam I am!'
In [6]: tokens = nltk.word_tokenize(text)
In [7]: default_tagger = nltk.DefaultTagger('NN')
Automatic tagging
* Default tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard35
In [8]: default_tagger.tag(tokens)
Out[8]: 
[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
Automatic tagging
* Default tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard36
...
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]
Automatic tagging
* Default tagger
· This method performs rather poorly
· Unknown words will be nouns (as it
happens, most new words are nouns)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard37
In [9]: default_tagger.evaluate(brown_tagged_sents)
Out[9]: 0.13089484257215028
Automatic tagging
* Regular expression tagger
· Assigns tags to tokens on the basis of
matching patterns
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard38
In [10]: patterns = [
   ...: (r'.*ing$', 'VBG'),              # gerounds
   ...: (r'.*ed$', 'VBD'),               # simple past
   ...: (r'.*es$', 'VBZ'),               # 3rd sing present
   ...: (r'.*ould$', 'MD'),              # modals
   ...: (r'.*'s$', 'NN$'),              # possessive nouns
   ...: (r'.*s$', 'NNS'),                # plural nouns
   ...: (r'^­?[0­9]+(.[0­9]+)?$', 'CD'), # cardinal numbers
   ...: (r'.*', 'NN'),                   # nouns (default)
   ...: ]
In [11]: regexp_tagger = nltk.RegexpTagger(patterns)
Automatic tagging
* Regular expression tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard39
In [12]: regexp_tagger.tag(brown_sents[3])
Out[12]: 
[('``', 'NN'),
 ('Only', 'NN'),
 ('a', 'NN'),
 ('relative', 'NN'),
 ('handful', 'NN'),
 ('of', 'NN'),
 ('such', 'NN'),
 ('reports', 'NNS'),
 ('was', 'NNS'),
 ('received', 'VBD'),
 ...]
Automatic tagging
* Regular expression tagger
· This method is correct about a fifth of
the time
· The final regular expression «.*» is a
catch-all that tags everything as a noun
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard40
In [13]: regexp_tagger.evaluate(brown_tagged_sents)
Out[13]: 0.20326391789486245
Automatic tagging
* Lookup tagger
· Problem: a lot of high-frequency words
do not have the NN tag
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard41
Automatic tagging
* Lookup tagger
· Solution:
+ Find the hundred most frequent words
and store their most likely tag
+ Use this information as model for a
lookup tagger (NLTK UnigramTagger)
+ Tag everything else as a noun
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard42
Automatic tagging
* Lookup tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard43
In [14]: fd = nltk.FreqDist(brown.words(categories='news'))
In [15]: cfd = #counts how many times a word belongs to a category 
nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
In [16]: most_freq_words = fd.keys()[:100]
In [17]: likely_tags = dict((word, cfd[word].max()) for word in 
most_freq_words) #from all categories of a word, take the maximum
In [18]: baseline_tagger = nltk.UnigramTagger(model=likely_tags, 
backoff=nltk.DefaultTagger('NN'))
In [19]: baseline_tagger.evaluate(brown_tagged_sents)
Out[19]: 0.5817769556656125
Automatic tagging
* Lookup tagger
· The tagger
accuracy
increases as
the model
size grows
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard44
n-gram tagging
* Unigram tagger
· As the lookup tagger, assign the most
likely tag to each token
· As opposed to the default tagger, it is
trained for setting it up
· Training: initialize the tagger with a
tagged sentence data as a parameter
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard45
n-gram tagging
* Unigram tagger
· Separate the data in:
+ Training data (90%)
+ Testing data (10%)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard46
n-gram tagging
* Unigram tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard47
In [20]: size = int(len(brown_tagged_sents) * 0.9)
In [21]: train_sents = brown_tagged_sents[:size]
In [22]: test_sents = brown_tagged_sents[size:]
In [23]: unigram_tagger = nltk.UnigramTagger(train_sents)
n-gram tagging
* Unigram tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard48
In [24]: unigram_tagger.tag(brown_sents[2007])
Out[24]: 
[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
...
n-gram tagging
* Unigram tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard49
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]
In [21]: unigram_tagger.evaluate(test_sents)
Out[21]: 0.8110236220472441
n-gram tagging
* An n-gram tagger picks the tag that is
most likely in the given context
* Unigram (1-gram) tagger
· Context:
+ current token in isolation
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard50
n-gram tagging
* Bigram (2-gram) tagger
· Context:
+ current token
+ POS tag of the 1 preceding token
* Trigram (3-gram) tagger
· Context:
+ current token
+ POS tag of the 2 preceding tokens
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard51
n-gram tagging
* n-gram tagger
· Context:
+ current token
+ POS tag of the n-1 preceding tokens
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard52
n-gram tagging
* n-gram tagger
· Example: bigram
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard53
In [22]: bigram_tagger = nltk.BigramTagger(train_sents)
In [23]: bigram_tagger.evaluate(train_sents)
Out[23]: 0.7853094861965731
In [24]: bigram_tagger.evaluate(test_sents)
Out[24]: 0.10216286255357321
n-gram tagging
* n-gram tagger
· Example: bigram
+ Problem: it manages to tag words in
sentences of training data but
- it is unable to tag a new word
(assigns None)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard54
n-gram tagging
* n-gram tagger
· Example: bigram
+ Problem: it manages to tag words in
sentences of training data but
- it cannot tag the following word
(even if it is not new) because it
never saw it during training with
a None tag on the previous word
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard55
n-gram tagging
* n-gram tagger
· Example: bigram
+ Name: sparse data
+ Reason: specific contexts with no
default tagger
+ Solution: trade-off between accuracy
and coverage
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard56
n-gram tagging
* Combining taggers
· Trade-off between accuracy and
coverage
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard57
n-gram tagging
* Combining taggers
1. Try tagging with the n-gram tagger
2. If unable, try the (n-1)-gram tagger
3. If unable, try the (n-2)-gram tagger
...
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard58
n-gram tagging
* Combining taggers
...
n-2. If unable, try the trigram tagger
n-1. If unable, try the bigram tagger
n. If unable, try the unigram tagger
n+1. If unable, use the default tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard59
n-gram tagging
* Combining taggers
· Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard60
In [25]: t0 = nltk.DefaultTagger('NN')
In [26]: t1 = nltk.UnigramTagger(train_sents, backoff=t0)
In [27]: t2 = nltk.BigramTagger(train_sents, backoff=t1)
In [28]: t2.evaluate(test_sents)
Out[28]: 0.8447124489185687
n-gram tagging
* Exercise 1
· Build a tagger by combining
a trigram, a bigram, a unigram
and a regular expression tagger (in the
default case)
· Use it to tag a sentence
· Evaluate its performance
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard61
n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard62
import nltk
import re
from nltk.corpus import brown
n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard63
patterns = [
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*es$', 'VBZ'),
    (r'.*ould$', 'MD'),
    (r".'s$", 'NN$'),
    (r'.*s$', 'NNS'),
    (r'^­?[0­9]+(.[0­9]+)?$', 'CD'),
    (r'.*', 'NN')
]
n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard64
brown_tagged_sents = 
brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard65
t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t1)
n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard66
brown_sents = brown.sents(categories='news')
sent = brown_sents[2007]
t3.tag(sent)
t3.evaluate(brown_tagged_sents)
References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 5: Categorizing and Tagging Words.” Natural Language Processing
with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard67
Text classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard68
Supervised classification
* Idea
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard69
Supervised classification
* Process
1. Features
2. Encode
3. Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard70
Supervised classification
* The process involves important skills:
· Abstraction
· Modelling
· Programming
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard71
Supervised classification
* Features
· Abstraction: decide the relevant
information of the data set
* Encode
· Modelling: choose a sound representation
(data structure)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard72
Supervised classification
* Feature extractor
· Programming: program a function that
extracts the features in the chosen
representation
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard73
Supervised classification
* Applications:
· Deciding the lexical category of words:
POS tagging
· Deciding the topic of a document from
a list of topics (“sports”, “technology”,
etc.): document classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard74
Document classification
* Example 1: gender identification
(solved by Naive Bayesian Classifier)
· Evidence
+ Names ending in a, e, i => female
+ Names ending in k, o, r, s, t => male
· Features: last letter
· Encode: dictionary
· Feature extractor: “name => {last letter}”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard75
Document classification
* Example 1: gender identification
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard76
In [1]: from nltk.corpus import names
In [2]: import random
In [3]: all_names = 
[(name, 'male') for name in names.words('male.txt')] + 
[(name, 'female') for name in names.words('female.txt')]
In [4]: random.shuffle(all_names)
Document classification
* Example 1: gender identification
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard77
In [5]: def gender_features(word):
            return {'last_letter': word[­1]}
# Example
In [6]: gender_features('Shrek')
Out[6]: {'last_letter': 'k'}
Document classification
* Example 1: gender identification
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard78
In [7]: featuresets =
[(gender_features(n), g) for (n,g) in all_names]
In [8]: train_set = featuresets[500:]
In [9]: test_set = featuresets[:500]
In [10]: classifier = 
nltk.NaiveBayesClassifier.train(train_set)
In [11]: nltk.classify.accuracy(classifier, test_set)
Out[11]: 0.778
Document classification
* Example 2: POS tagging
(solved by Decision Tree Classifier)
· Results: POS tag
· Features: Suffixes
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard79
Document classification
* Example 2: POS tagging
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard80
In [1]: from nltk.corpus import brown
In [2]: suffix_fdist = nltk.FreqDist()
In [3]: for word in brown.words():
            word = word.lower()
            suffix_fdist.inc(word[­1:])
            suffix_fdist.inc(word[­2:])
            suffix_fdist.inc(word[­3:])
In [4]: common_suffixes = suffix_fdist.keys()[:100]
Document classification
* Example 2: POS tagging
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard81
In [5]: def pos_features(word):
            features = {}
            for suffix in common_suffixes:
                features['endswith(%s)' % suffix] =          
                    word.lower().endswith(suffix)
            return features
Document classification
* Example 2: POS tagging
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard82
In [6]: tagged_words = brown.tagged_words(categories='news')
In [7]: featuresets =
[(pos_features(n), g) for (n,g) in tagged_words]
In [8]: size = int(len(featuresets) * 0.1)
In [9]: train_set, test_set =
featuresets[size:], featuresets[:size]
Document classification
* Example 2: POS tagging
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard83
In [10]: classifier = 
nltk.DecisionTreeClassifier.train(train_set)
In [11]: classifier.classify(pos_features('cats'))
Out[11]: 'NNS'
In [12]: nltk.classify.accuracy(classifier, test_set)
0.62705121829935351
Document classification
* Example 3: document classification
(solved by Naive Bayesian Classifier)
· Corpus: Movie Reviews Corpus
· Results: Positive or negative review
· Features: Indicate whether or not the
2000 most frequent words are present in
each review
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard84
Document classification
* Example 3: document classification
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard85
In [1]: from nltk.corpus import movie_reviews
In [2]: import random
In [3]: documents =
            [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
In [4]: random.shuffle(documents)
Document classification
* Example 3: document classification
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard86
In [5]: all_words = nltk.FreqDist(
            w.lower() for w in movie_reviews.words())
In [6]: word_features = all_words.keys()[:2000]
In [7]: def document_features(document):      
            document_words = set(document)
            features = {}
            for word in word_features:
                features['contains(%s)' % word] = 
                    (word in document_words)
            return features
Document classification
* Example 3: document classification
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard87
In [7]: featuresets =
[(document_features(d), c) for (d,c) in documents]
In [8]: train_set = featuresets[100:]
In [9]: test_set = featuresets[:100]
In [10]: classifier = 
nltk.NaiveBayesClassifier.train(train_set)
In [11]: nltk.classify.accuracy(classifier, test_set)
Out[11]: 0.84
Document classification
* Example 3: document classification
· 5 most informative features
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard88
In [12]: classifier.show_most_informative_features(5)
Most Informative Features
   contains(outstanding) = True    pos : neg = 10.7 : 1.0
         contains(mulan) = True    pos : neg =  9.0 : 1.0
        contains(seagal) = True    neg : pos =  8.2 : 1.0
   contains(wonderfully) = True    pos : neg =  6.4 : 1.0
         contains(damon) = True    pos : neg =  6.4 : 1.0
Document classification
* Exercise 2
· “Reuters-21578 benchmark corpus /
ApteMod version” is a collection of 10,788
documents from the Reuters financial
newswire service
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard89
Document classification
* Exercise 2
· Train a naive Bayes classifier with
ApteMod corpus
· Use it to classify a document
· Evalutate its performance
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard90
Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard91
import nltk
import random
from nltk.corpus import reuters
Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard92
documents = [(list(reuters.words(fileid)), category)
    for category in reuters.categories()
        for fileid in reuters.fileids(category)]
random.shuffle(documents)
Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard93
all_words = nltk.FreqDist(w.lower() for w in 
reuters.words())
word_features = all_words.keys()[:2000]
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = 
            (word in document_words)
    return features
Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard94
featuresets = [(document_features(d), c) for (d,c) in 
documents]
size = int(len(featuresets) * 0.9)
train_set = featuresets[size:]
test_set = featuresets[:size]
classifier = 
nltk.NaiveBayesClassifier.train(train_set)
Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard95
document = reuters.words('test/14826')
classifier.classify(document_features(document))
nltk.classify.accuracy(classifier, test_set)
References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text.” Natural Language Processing with
Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard96
Text information extraction
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard97
Information extraction
* Definition:
· Convert unstructured data of natural
language into structured data of table
· Get information from tabulated data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard98
Information extraction
* Arquitecture:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard99
Entity recognition
* Chunking
· Segments and labels multitoken sequences
· Selects a subset of the tokens (chunks)
· Chunks do not overlap in the source text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard100
Entity recognition
* Chunking
· Entities are mostly nouns
· Let us search for the noun phrase chunks
(NP-chunks)
· Grammar: set of rules that indicate how
sentences should be chunked
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard101
Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard102
In [1]: import nltk, re, pprint
In [2]: grammar = r"""
# chunk optional determiner/possessive, adjectives and nouns
NP: {<DT|PP$>?<JJ>*<NN>} 
# chunk sequences of proper nouns
{<NNP>+}
"""
In [3]: cp = nltk.RegexpParser(grammar)
Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard103
In [4]: sentence1 = [("the", "DT"), ("little", "JJ"), 
("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", 
"IN"), ("the", "DT"), ("cat", "NN")]
In [5]: sentence2 = [("Rapunzel", "NNP"), ("let", "VBD"), 
("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", 
"JJ"), ("hair", "NN")]
In [6]: result1 = cp.parse(sentence)
In [7]: result2 = cp.parse(sentence)
Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard104
In [8]: print result1
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
In [9]: print result2
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))
Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard105
In [10]: result1.draw()
Entity recognition
* Chunking text corpora
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard106
In [11]: for sent in brown.tagged_sents():
tree = cp.parse(sent)
for subtree in tree.subtrees():
    if subtree.node == 'NP':
        nps.append(subtree)
In [12]: for np in nps[:10]:
print np
(NP investigation/NN)
(NP widespread/JJ interest/NN)
(NP this/DT city/NN)
(NP new/JJ multi­million­dollar/JJ airport/NN)
(NP his/PP$ wife/NN)
(NP His/PP$ political/JJ career/NN)
...
Entity recognition
* Named entities
· Are definite noun phrases
· Refer to specific types of individuals:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard107
Entity recognition
* Named entity recognition
· Task well suited to classifier-based
approach for noun phrase chunking
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard108
Entity recognition
* Named entity recognition
· Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard109
In [1]: sent = nltk.corpus.treebank.tagged_sents()[22]
In [2]: print nltk.ne_chunk(sent)
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)
Relation extraction
* Extraction of relations that exists between
the named entities recognized
* Approach: initially look for all triples of
the form (X, , Y)α
· X and Y are named entities of specific
types
· is the relationα
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard110
Relation extraction
* Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard111
In [1]: import nltk
In [2]: import re
In [3]: IN = re.compile(r'.*binb(?!b.+ing)')
In [4]: for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, 
corpus='ieer', pattern = IN):
        print nltk.sem.relextract.show_raw_rtuple(rel)
Relation extraction
* Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard112
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 
'Washington']
[ORG: 'Idealab'] ', a self­described business incubator based in' 
[LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia­Pacific'] 'in' [LOC: 'Atlanta']
Relation extraction
* Exercise 3
· From the corpus ieer, extract
all the relations of type “people were
born in a location”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard113
Relation extraction
* Exercise 3
· Extract all the relations of type
“people were born in a location” from
the corpus ieer
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard114
Relation extraction
* Exercise 3 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard115
import nltk
import os
import re
BORN = re.compile(r'.*bbornb')
files = filter(lambda x: x != 'README', 
os.listdir('nltk_data/corpora/ieer'))
for f in files:
    for doc in nltk.corpus.ieer.parsed_docs(f):
        for rel in nltk.sem.extract_rels('PER', 'LOC', doc, 
corpus='ieer', pattern=BORN):
            print nltk.sem.relextract.show_raw_rtuple(rel)
References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 7: Extracting Information from Text.” Natural Language Processing
with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard116
Assignment
* Assignment 9
· Readings
+ Supervised classification (Natural
Language Processing with Python)
+ Decision Tree Learning (Machine
Learning)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard117
References
Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text - Supervised Classification.” Natural
Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard118
Bibliography
“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Mitchell, Tom M. Machine Learning. New York: McGraw-Hill, 1997. Print.
“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009. 504.
shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard119

Weitere ähnliche Inhalte

Ähnlich wie Lecture09

A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...
A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...
A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...Vitomir Kovanovic
 
A novel model of cognitive presence assessment using automated learning analy...
A novel model of cognitive presence assessment using automated learning analy...A novel model of cognitive presence assessment using automated learning analy...
A novel model of cognitive presence assessment using automated learning analy...Vitomir Kovanovic
 
Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationSameera Horawalavithana
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chiBarbara Starr
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Beyond Verbocentricity
Beyond VerbocentricityBeyond Verbocentricity
Beyond VerbocentricityGreg Mcverry
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldaelang
 
Civilization And Savagery Essay
Civilization And Savagery EssayCivilization And Savagery Essay
Civilization And Savagery EssayMelanie Mendoza
 
Unit 3-the-communication-process
Unit 3-the-communication-processUnit 3-the-communication-process
Unit 3-the-communication-processmadhu sikha
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled GraphsMarko Rodriguez
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Computational Semantics
Computational SemanticsComputational Semantics
Computational SemanticsRossi Setchi
 
Sqqs1013 ch1-a122
Sqqs1013 ch1-a122Sqqs1013 ch1-a122
Sqqs1013 ch1-a122kim rae KI
 

Ähnlich wie Lecture09 (20)

Lecture05
Lecture05Lecture05
Lecture05
 
Lecture06
Lecture06Lecture06
Lecture06
 
Lecture03
Lecture03Lecture03
Lecture03
 
Lecture01
Lecture01Lecture01
Lecture01
 
Lecture04
Lecture04Lecture04
Lecture04
 
A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...
A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...
A Novel Model of Cognitive Presence Assessment Using Automated Learning Analy...
 
A novel model of cognitive presence assessment using automated learning analy...
A novel model of cognitive presence assessment using automated learning analy...A novel model of cognitive presence assessment using automated learning analy...
A novel model of cognitive presence assessment using automated learning analy...
 
Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
 
Dgfs07
Dgfs07Dgfs07
Dgfs07
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Beyond Verbocentricity
Beyond VerbocentricityBeyond Verbocentricity
Beyond Verbocentricity
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Civilization And Savagery Essay
Civilization And Savagery EssayCivilization And Savagery Essay
Civilization And Savagery Essay
 
Unit 3-the-communication-process
Unit 3-the-communication-processUnit 3-the-communication-process
Unit 3-the-communication-process
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled Graphs
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Computational Semantics
Computational SemanticsComputational Semantics
Computational Semantics
 
Sqqs1013 ch1-a122
Sqqs1013 ch1-a122Sqqs1013 ch1-a122
Sqqs1013 ch1-a122
 

Kürzlich hochgeladen

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Kürzlich hochgeladen (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Lecture09

  • 1. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard Department of Modern Languages and Literatures Western University
  • 2. Lecture 9 Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard * Contents: 1. Why this lecture? 2. Discussion 3. Chapter 9 4. Assignment 5. Bibliography 2
  • 3. Why this lecture? Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard * This lecture... · teaches some NLP techniques subject to be applied to real problems · presents another example of how DH put together various disciplines (Linguistics, Artificial Intelligence, Information Science, Statistics...) to solve problems 3
  • 4. Last assignment discussion Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard * Time to... · consolidate ideas and concepts dealt in the readings 4
  • 5. Chapter 9 Natural Language Processing in Python 1. Preliminary theory 2. Word tagging and categorization 3. Text classification 4. Text information extraction Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard5
  • 6. Chapter 9 1 Preliminary theory 1.1 Linguistics 1.2 Statistics 1.3 Artificial Intelligence Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard6
  • 7. Chapter 9 2 Word tagging and categorization 2.1 Tagger 2.2 Automatic tagging 2.3 n-gram tagging Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard7
  • 8. Chapter 9 3 Text classification 3.1 Supervised classification 3.2 Document classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard8
  • 9. Chapter 9 4 Text information extraction 4.1 Information extraction 4.2 Entity recognition 4.3 Relation extraction Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard9
  • 10. Preliminary theory Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard10
  • 11. Linguistics * Lexical categories · nouns: people, places, things, concepts · verbs: actions · adjectives: describes nouns · adverbs: modifies adjectives and verbs · ... Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard11
  • 12. Linguistics * Lexical categories Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard12
  • 13. Linguistics * These word classes are also known as part-of-speech * They arise from simple analysis of the distribution of words in text Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard13
  • 14. Statistics * Frequency distribution · Arrangement of the values that one or more variables take in a sample Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard14
  • 15. Statistics * Frequency distribution · Example: vocabulary in a text + how many times each word appears in the text? + it is a “distribution” since it tells us how the total number of word tokens in the text are distributed across the vocabulary items Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard15
  • 16. Statistics * Frequency distribution · Example: vocabulary in a text Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard16
  • 17. Statistics * Conditional frequency distribution · A collection of frequency distributions, each one for a different condition Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard17
  • 18. Statistics * Conditional frequency distribution · Example: vocabulary in a text + when the texts of a corpus are divided into several categories we can maintain separate frequency distributions for each category + the condition will often be the category of the text Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard18
  • 19. Statistics * Conditional frequency distribution · Example: vocabulary in a text Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard19
  • 20. Artificial Intelligence * Supervised vs unsupervised learning · Supervised learning: + Possible results are known + Data is labeled · Unsupervised learning: + Results are unknown + Data is clustered Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard20
  • 21. Artificial Intelligence * Decision trees · Flowchart that selects labels for input values · Formed by decision and leaf nodes · Decision nodes: check feature values · Leaf nodes: assign labels Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard21
  • 22. Artificial Intelligence * Decision trees · Example: “Going out?” Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard22
  • 23. Artificial Intelligence * Naive Bayes classifiers 1. Begins by calculating the prior probability of each label, determined by checking the frequency of each label in the training set Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard23
  • 24. Artificial Intelligence * Naive Bayes classifiers 2. The contribution from each feature is combined with this prior probability, to arrive at a likelihood estimate for each label 3. The label whose likelihood estimate is the highest is then assigned to the input value Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard24
  • 25. Artificial Intelligence * Naive Bayes classifiers · Example: document classification Prior probability: close “Automotive” Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard25
  • 26. References “Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014. Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print. Mitchell, Tom M. “Chapter 6: Bayesian Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print. “Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014. Steven Bird, Ewan Klein, and Edward Loper. “Conditional Frequency Distributions.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Steven Bird, Ewan Klein, and Edward Loper. “Frequency Distributions.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard26
  • 27. Word tagging and classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard27
  • 28. Tagger * Processes a sequence of words, and attaches a part of speech tag to each word * Procedure: 1. Tokenization 2. Tagging Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard28
  • 29. Tagger * Example 1: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard29 In [1]: text = 'And now for something completely different' In [2]: tokens = nltk.word_tokenize(text) In [3]: nltk.pos_tag(tokens) Out[3]:  [('And', 'CC'),  ('now', 'RB'),  ('for', 'IN'),  ('something', 'NN'),  ('completely', 'RB'),  ('different', 'JJ')]
  • 30. Tagger * Example 2: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard30 In [1]: text = 'They refuse to permit us to obtain the  refuse permit' In [2]: tokens = nltk.word_tokenize(text) In [3]: nltk.pos_tag(tokens) Out[3]:  [('They', 'PRP'),  ('refuse', 'VBP'),  ('to', 'TO'),  ('permit', 'VB'),  ('us', 'PRP'),  ('to', 'TO'),  ('obtain', 'VB'),  ('the', 'DT'),  ('refuse', 'NN'),  ('permit', 'NN')]
  • 31. Automatic tagging * The tag of a word depends on the word itself and its context within a sentence * Working with data at the level of tagged sentences rather than tagged words Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard31
  • 32. Automatic tagging * Loading data · Example: tagged and non-tagged sentences of “news” category Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard32 In [1]: from nltk.corpus import brown In [2]: brown_tagged_sents =                               brown.tagged_sents(categories='news') In [3]: brown_sents = brown.sents(categories='news')
  • 33. Automatic tagging * Default tagger · Chose the most likely tag Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard33 In [4]: tags = [tag for (word, tag) in  brown.tagged_words(categories='news')] In [4]: nltk.FreqDist(tags).max() Out[4]: 'NN'
  • 34. Automatic tagging * Default tagger · Assign the most likely tag to each token Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard34 In [5]: text = 'I do not like green eggs and ham, I do not  like them Sam I am!' In [6]: tokens = nltk.word_tokenize(text) In [7]: default_tagger = nltk.DefaultTagger('NN')
  • 35. Automatic tagging * Default tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard35 In [8]: default_tagger.tag(tokens) Out[8]:  [('I', 'NN'),  ('do', 'NN'),  ('not', 'NN'),  ('like', 'NN'),  ('green', 'NN'),  ('eggs', 'NN'),  ('and', 'NN'),  ('ham', 'NN'),  (',', 'NN'),
  • 36. Automatic tagging * Default tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard36 ...  ('I', 'NN'),  ('do', 'NN'),  ('not', 'NN'),  ('like', 'NN'),  ('them', 'NN'),  ('Sam', 'NN'),  ('I', 'NN'),  ('am', 'NN'),  ('!', 'NN')]
  • 37. Automatic tagging * Default tagger · This method performs rather poorly · Unknown words will be nouns (as it happens, most new words are nouns) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard37 In [9]: default_tagger.evaluate(brown_tagged_sents) Out[9]: 0.13089484257215028
  • 38. Automatic tagging * Regular expression tagger · Assigns tags to tokens on the basis of matching patterns Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard38 In [10]: patterns = [    ...: (r'.*ing$', 'VBG'),              # gerounds    ...: (r'.*ed$', 'VBD'),               # simple past    ...: (r'.*es$', 'VBZ'),               # 3rd sing present    ...: (r'.*ould$', 'MD'),              # modals    ...: (r'.*'s$', 'NN$'),              # possessive nouns    ...: (r'.*s$', 'NNS'),                # plural nouns    ...: (r'^­?[0­9]+(.[0­9]+)?$', 'CD'), # cardinal numbers    ...: (r'.*', 'NN'),                   # nouns (default)    ...: ] In [11]: regexp_tagger = nltk.RegexpTagger(patterns)
  • 39. Automatic tagging * Regular expression tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard39 In [12]: regexp_tagger.tag(brown_sents[3]) Out[12]:  [('``', 'NN'),  ('Only', 'NN'),  ('a', 'NN'),  ('relative', 'NN'),  ('handful', 'NN'),  ('of', 'NN'),  ('such', 'NN'),  ('reports', 'NNS'),  ('was', 'NNS'),  ('received', 'VBD'),  ...]
  • 40. Automatic tagging * Regular expression tagger · This method is correct about a fifth of the time · The final regular expression «.*» is a catch-all that tags everything as a noun Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard40 In [13]: regexp_tagger.evaluate(brown_tagged_sents) Out[13]: 0.20326391789486245
  • 41. Automatic tagging * Lookup tagger · Problem: a lot of high-frequency words do not have the NN tag Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard41
  • 42. Automatic tagging * Lookup tagger · Solution: + Find the hundred most frequent words and store their most likely tag + Use this information as model for a lookup tagger (NLTK UnigramTagger) + Tag everything else as a noun Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard42
  • 43. Automatic tagging * Lookup tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard43 In [14]: fd = nltk.FreqDist(brown.words(categories='news')) In [15]: cfd = #counts how many times a word belongs to a category  nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) In [16]: most_freq_words = fd.keys()[:100] In [17]: likely_tags = dict((word, cfd[word].max()) for word in  most_freq_words) #from all categories of a word, take the maximum In [18]: baseline_tagger = nltk.UnigramTagger(model=likely_tags,  backoff=nltk.DefaultTagger('NN')) In [19]: baseline_tagger.evaluate(brown_tagged_sents) Out[19]: 0.5817769556656125
  • 44. Automatic tagging * Lookup tagger · The tagger accuracy increases as the model size grows Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard44
  • 45. n-gram tagging * Unigram tagger · As the lookup tagger, assign the most likely tag to each token · As opposed to the default tagger, it is trained for setting it up · Training: initialize the tagger with a tagged sentence data as a parameter Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard45
  • 46. n-gram tagging * Unigram tagger · Separate the data in: + Training data (90%) + Testing data (10%) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard46
  • 47. n-gram tagging * Unigram tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard47 In [20]: size = int(len(brown_tagged_sents) * 0.9) In [21]: train_sents = brown_tagged_sents[:size] In [22]: test_sents = brown_tagged_sents[size:] In [23]: unigram_tagger = nltk.UnigramTagger(train_sents)
  • 48. n-gram tagging * Unigram tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard48 In [24]: unigram_tagger.tag(brown_sents[2007]) Out[24]:  [('Various', 'JJ'),  ('of', 'IN'),  ('the', 'AT'),  ('apartments', 'NNS'),  ('are', 'BER'),  ('of', 'IN'),  ('the', 'AT'),  ('terrace', 'NN'),  ('type', 'NN'),  (',', ','), ...
  • 49. n-gram tagging * Unigram tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard49  ('being', 'BEG'),  ('on', 'IN'),  ('the', 'AT'),  ('ground', 'NN'),  ('floor', 'NN'),  ('so', 'QL'),  ('that', 'CS'),  ('entrance', 'NN'),  ('is', 'BEZ'),  ('direct', 'JJ'),  ('.', '.')] In [21]: unigram_tagger.evaluate(test_sents) Out[21]: 0.8110236220472441
  • 50. n-gram tagging * An n-gram tagger picks the tag that is most likely in the given context * Unigram (1-gram) tagger · Context: + current token in isolation Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard50
  • 51. n-gram tagging * Bigram (2-gram) tagger · Context: + current token + POS tag of the 1 preceding token * Trigram (3-gram) tagger · Context: + current token + POS tag of the 2 preceding tokens Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard51
  • 52. n-gram tagging * n-gram tagger · Context: + current token + POS tag of the n-1 preceding tokens Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard52
  • 53. n-gram tagging * n-gram tagger · Example: bigram Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard53 In [22]: bigram_tagger = nltk.BigramTagger(train_sents) In [23]: bigram_tagger.evaluate(train_sents) Out[23]: 0.7853094861965731 In [24]: bigram_tagger.evaluate(test_sents) Out[24]: 0.10216286255357321
  • 54. n-gram tagging * n-gram tagger · Example: bigram + Problem: it manages to tag words in sentences of training data but - it is unable to tag a new word (assigns None) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard54
  • 55. n-gram tagging * n-gram tagger · Example: bigram + Problem: it manages to tag words in sentences of training data but - it cannot tag the following word (even if it is not new) because it never saw it during training with a None tag on the previous word Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard55
  • 56. n-gram tagging * n-gram tagger · Example: bigram + Name: sparse data + Reason: specific contexts with no default tagger + Solution: trade-off between accuracy and coverage Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard56
  • 57. n-gram tagging * Combining taggers · Trade-off between accuracy and coverage Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard57
  • 58. n-gram tagging * Combining taggers 1. Try tagging with the n-gram tagger 2. If unable, try the (n-1)-gram tagger 3. If unable, try the (n-2)-gram tagger ... Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard58
  • 59. n-gram tagging * Combining taggers ... n-2. If unable, try the trigram tagger n-1. If unable, try the bigram tagger n. If unable, try the unigram tagger n+1. If unable, use the default tagger Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard59
  • 60. n-gram tagging * Combining taggers · Example: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard60 In [25]: t0 = nltk.DefaultTagger('NN') In [26]: t1 = nltk.UnigramTagger(train_sents, backoff=t0) In [27]: t2 = nltk.BigramTagger(train_sents, backoff=t1) In [28]: t2.evaluate(test_sents) Out[28]: 0.8447124489185687
  • 61. n-gram tagging * Exercise 1 · Build a tagger by combining a trigram, a bigram, a unigram and a regular expression tagger (in the default case) · Use it to tag a sentence · Evaluate its performance Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard61
  • 62. n-gram tagging * Exercise 1 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard62 import nltk import re from nltk.corpus import brown
  • 63. n-gram tagging * Exercise 1 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard63 patterns = [     (r'.*ing$', 'VBG'),     (r'.*ed$', 'VBD'),     (r'.*es$', 'VBZ'),     (r'.*ould$', 'MD'),     (r".'s$", 'NN$'),     (r'.*s$', 'NNS'),     (r'^­?[0­9]+(.[0­9]+)?$', 'CD'),     (r'.*', 'NN') ]
  • 64. n-gram tagging * Exercise 1 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard64 brown_tagged_sents =  brown.tagged_sents(categories='news') size = int(len(brown_tagged_sents) * 0.9) train_sents = brown_tagged_sents[:size] test_sents = brown_tagged_sents[size:]
  • 65. n-gram tagging * Exercise 1 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard65 t0 = nltk.RegexpTagger(patterns) t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1) t3 = nltk.TrigramTagger(train_sents, backoff=t1)
  • 66. n-gram tagging * Exercise 1 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard66 brown_sents = brown.sents(categories='news') sent = brown_sents[2007] t3.tag(sent) t3.evaluate(brown_tagged_sents)
  • 67. References Steven Bird, Ewan Klein, and Edward Loper. “Chapter 5: Categorizing and Tagging Words.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard67
  • 68. Text classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard68
  • 69. Supervised classification * Idea Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard69
  • 70. Supervised classification * Process 1. Features 2. Encode 3. Feature extractor Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard70
  • 71. Supervised classification * The process involves important skills: · Abstraction · Modelling · Programming Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard71
  • 72. Supervised classification * Features · Abstraction: decide the relevant information of the data set * Encode · Modelling: choose a sound representation (data structure) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard72
  • 73. Supervised classification * Feature extractor · Programming: program a function that extracts the features in the chosen representation Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard73
  • 74. Supervised classification * Applications: · Deciding the lexical category of words: POS tagging · Deciding the topic of a document from a list of topics (“sports”, “technology”, etc.): document classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard74
  • 75. Document classification * Example 1: gender identification (solved by Naive Bayesian Classifier) · Evidence + Names ending in a, e, i => female + Names ending in k, o, r, s, t => male · Features: last letter · Encode: dictionary · Feature extractor: “name => {last letter}” Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard75
  • 76. Document classification * Example 1: gender identification · Data Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard76 In [1]: from nltk.corpus import names In [2]: import random In [3]: all_names =  [(name, 'male') for name in names.words('male.txt')] +  [(name, 'female') for name in names.words('female.txt')] In [4]: random.shuffle(all_names)
  • 77. Document classification * Example 1: gender identification · Feature extractor Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard77 In [5]: def gender_features(word):             return {'last_letter': word[­1]} # Example In [6]: gender_features('Shrek') Out[6]: {'last_letter': 'k'}
  • 78. Document classification * Example 1: gender identification · Classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard78 In [7]: featuresets = [(gender_features(n), g) for (n,g) in all_names] In [8]: train_set = featuresets[500:] In [9]: test_set = featuresets[:500] In [10]: classifier =  nltk.NaiveBayesClassifier.train(train_set) In [11]: nltk.classify.accuracy(classifier, test_set) Out[11]: 0.778
  • 79. Document classification * Example 2: POS tagging (solved by Decision Tree Classifier) · Results: POS tag · Features: Suffixes Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard79
  • 80. Document classification * Example 2: POS tagging · Data Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard80 In [1]: from nltk.corpus import brown In [2]: suffix_fdist = nltk.FreqDist() In [3]: for word in brown.words():             word = word.lower()             suffix_fdist.inc(word[­1:])             suffix_fdist.inc(word[­2:])             suffix_fdist.inc(word[­3:]) In [4]: common_suffixes = suffix_fdist.keys()[:100]
  • 81. Document classification * Example 2: POS tagging · Feature extractor Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard81 In [5]: def pos_features(word):             features = {}             for suffix in common_suffixes:                 features['endswith(%s)' % suffix] =                               word.lower().endswith(suffix)             return features
  • 82. Document classification * Example 2: POS tagging · Classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard82 In [6]: tagged_words = brown.tagged_words(categories='news') In [7]: featuresets = [(pos_features(n), g) for (n,g) in tagged_words] In [8]: size = int(len(featuresets) * 0.1) In [9]: train_set, test_set = featuresets[size:], featuresets[:size]
  • 83. Document classification * Example 2: POS tagging · Classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard83 In [10]: classifier =  nltk.DecisionTreeClassifier.train(train_set) In [11]: classifier.classify(pos_features('cats')) Out[11]: 'NNS' In [12]: nltk.classify.accuracy(classifier, test_set) 0.62705121829935351
  • 84. Document classification * Example 3: document classification (solved by Naive Bayesian Classifier) · Corpus: Movie Reviews Corpus · Results: Positive or negative review · Features: Indicate whether or not the 2000 most frequent words are present in each review Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard84
  • 85. Document classification * Example 3: document classification · Data Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard85 In [1]: from nltk.corpus import movie_reviews In [2]: import random In [3]: documents =             [(list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category)] In [4]: random.shuffle(documents)
  • 86. Document classification * Example 3: document classification · Feature extractor Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard86 In [5]: all_words = nltk.FreqDist(             w.lower() for w in movie_reviews.words()) In [6]: word_features = all_words.keys()[:2000] In [7]: def document_features(document):                   document_words = set(document)             features = {}             for word in word_features:                 features['contains(%s)' % word] =                      (word in document_words)             return features
  • 87. Document classification * Example 3: document classification · Classification Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard87 In [7]: featuresets = [(document_features(d), c) for (d,c) in documents] In [8]: train_set = featuresets[100:] In [9]: test_set = featuresets[:100] In [10]: classifier =  nltk.NaiveBayesClassifier.train(train_set) In [11]: nltk.classify.accuracy(classifier, test_set) Out[11]: 0.84
  • 88. Document classification * Example 3: document classification · 5 most informative features Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard88 In [12]: classifier.show_most_informative_features(5) Most Informative Features    contains(outstanding) = True    pos : neg = 10.7 : 1.0          contains(mulan) = True    pos : neg =  9.0 : 1.0         contains(seagal) = True    neg : pos =  8.2 : 1.0    contains(wonderfully) = True    pos : neg =  6.4 : 1.0          contains(damon) = True    pos : neg =  6.4 : 1.0
  • 89. Document classification * Exercise 2 · “Reuters-21578 benchmark corpus / ApteMod version” is a collection of 10,788 documents from the Reuters financial newswire service Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard89
  • 90. Document classification * Exercise 2 · Train a naive Bayes classifier with ApteMod corpus · Use it to classify a document · Evalutate its performance Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard90
  • 91. Document classification * Exercise 2 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard91 import nltk import random from nltk.corpus import reuters
  • 92. Document classification * Exercise 2 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard92 documents = [(list(reuters.words(fileid)), category)     for category in reuters.categories()         for fileid in reuters.fileids(category)] random.shuffle(documents)
  • 93. Document classification * Exercise 2 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard93 all_words = nltk.FreqDist(w.lower() for w in  reuters.words()) word_features = all_words.keys()[:2000] def document_features(document):     document_words = set(document)     features = {}     for word in word_features:         features['contains(%s)' % word] =              (word in document_words)     return features
  • 94. Document classification * Exercise 2 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard94 featuresets = [(document_features(d), c) for (d,c) in  documents] size = int(len(featuresets) * 0.9) train_set = featuresets[size:] test_set = featuresets[:size] classifier =  nltk.NaiveBayesClassifier.train(train_set)
  • 95. Document classification * Exercise 2 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard95 document = reuters.words('test/14826') classifier.classify(document_features(document)) nltk.classify.accuracy(classifier, test_set)
  • 96. References Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard96
  • 97. Text information extraction Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard97
  • 98. Information extraction * Definition: · Convert unstructured data of natural language into structured data of table · Get information from tabulated data Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard98
  • 99. Information extraction * Arquitecture: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard99
  • 100. Entity recognition * Chunking · Segments and labels multitoken sequences · Selects a subset of the tokens (chunks) · Chunks do not overlap in the source text Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard100
  • 101. Entity recognition * Chunking · Entities are mostly nouns · Let us search for the noun phrase chunks (NP-chunks) · Grammar: set of rules that indicate how sentences should be chunked Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard101
  • 102. Entity recognition * NP-chunker Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard102 In [1]: import nltk, re, pprint In [2]: grammar = r""" # chunk optional determiner/possessive, adjectives and nouns NP: {<DT|PP$>?<JJ>*<NN>}  # chunk sequences of proper nouns {<NNP>+} """ In [3]: cp = nltk.RegexpParser(grammar)
  • 103. Entity recognition * NP-chunker Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard103 In [4]: sentence1 = [("the", "DT"), ("little", "JJ"),  ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at",  "IN"), ("the", "DT"), ("cat", "NN")] In [5]: sentence2 = [("Rapunzel", "NNP"), ("let", "VBD"),  ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden",  "JJ"), ("hair", "NN")] In [6]: result1 = cp.parse(sentence) In [7]: result2 = cp.parse(sentence)
  • 104. Entity recognition * NP-chunker Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard104 In [8]: print result1 (S   (NP the/DT little/JJ yellow/JJ dog/NN)   barked/VBD   at/IN   (NP the/DT cat/NN)) In [9]: print result2 (S   (NP Rapunzel/NNP)   let/VBD   down/RP   (NP her/PP$ long/JJ golden/JJ hair/NN))
  • 105. Entity recognition * NP-chunker Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard105 In [10]: result1.draw()
  • 106. Entity recognition * Chunking text corpora Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard106 In [11]: for sent in brown.tagged_sents(): tree = cp.parse(sent) for subtree in tree.subtrees():     if subtree.node == 'NP':         nps.append(subtree) In [12]: for np in nps[:10]: print np (NP investigation/NN) (NP widespread/JJ interest/NN) (NP this/DT city/NN) (NP new/JJ multi­million­dollar/JJ airport/NN) (NP his/PP$ wife/NN) (NP His/PP$ political/JJ career/NN) ...
  • 107. Entity recognition * Named entities · Are definite noun phrases · Refer to specific types of individuals: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard107
  • 108. Entity recognition * Named entity recognition · Task well suited to classifier-based approach for noun phrase chunking Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard108
  • 109. Entity recognition * Named entity recognition · Example: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard109 In [1]: sent = nltk.corpus.treebank.tagged_sents()[22] In [2]: print nltk.ne_chunk(sent) (S   The/DT   (GPE U.S./NNP)   is/VBZ   one/CD   ...   according/VBG   to/TO   (PERSON Brooke/NNP T./NNP Mossman/NNP)   ...)
  • 110. Relation extraction * Extraction of relations that exists between the named entities recognized * Approach: initially look for all triples of the form (X, , Y)α · X and Y are named entities of specific types · is the relationα Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard110
  • 111. Relation extraction * Example: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard111 In [1]: import nltk In [2]: import re In [3]: IN = re.compile(r'.*binb(?!b.+ing)') In [4]: for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):     for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,  corpus='ieer', pattern = IN):         print nltk.sem.relextract.show_raw_rtuple(rel)
  • 112. Relation extraction * Example: Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard112 [ORG: 'WHYY'] 'in' [LOC: 'Philadelphia'] [ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo'] [ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington'] [ORG: 'Brookings Institution'] ', the research group in' [LOC:  'Washington'] [ORG: 'Idealab'] ', a self­described business incubator based in'  [LOC: 'Los Angeles'] [ORG: 'Open Text'] ', based in' [LOC: 'Waterloo'] [ORG: 'WGBH'] 'in' [LOC: 'Boston'] [ORG: 'Bastille Opera'] 'in' [LOC: 'Paris'] [ORG: 'Omnicom'] 'in' [LOC: 'New York'] [ORG: 'DDB Needham'] 'in' [LOC: 'New York'] [ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York'] [ORG: 'BBDO South'] 'in' [LOC: 'Atlanta'] [ORG: 'Georgia­Pacific'] 'in' [LOC: 'Atlanta']
  • 113. Relation extraction * Exercise 3 · From the corpus ieer, extract all the relations of type “people were born in a location” Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard113
  • 114. Relation extraction * Exercise 3 · Extract all the relations of type “people were born in a location” from the corpus ieer Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard114
  • 115. Relation extraction * Exercise 3 (solution) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard115 import nltk import os import re BORN = re.compile(r'.*bbornb') files = filter(lambda x: x != 'README',  os.listdir('nltk_data/corpora/ieer')) for f in files:     for doc in nltk.corpus.ieer.parsed_docs(f):         for rel in nltk.sem.extract_rels('PER', 'LOC', doc,  corpus='ieer', pattern=BORN):             print nltk.sem.relextract.show_raw_rtuple(rel)
  • 116. References Steven Bird, Ewan Klein, and Edward Loper. “Chapter 7: Extracting Information from Text.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard116
  • 117. Assignment * Assignment 9 · Readings + Supervised classification (Natural Language Processing with Python) + Decision Tree Learning (Machine Learning) Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard117
  • 118. References Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print. Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text - Supervised Classification.” Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard118
  • 119. Bibliography “Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014. Mitchell, Tom M. Machine Learning. New York: McGraw-Hill, 1997. Print. “Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014. Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014. Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard119