At Cytora, our production system works 24/7 to transform billions of pieces of unstructured web data into structured data sets. This is a huge job, and we use spaCy to help us on a daily basis.
SpaCy is an easy-to-use open source Python NLP library that excels at large-scale information extraction. It supports tokenization, sentence segmentation, named entity recognition, part of speech tagging and dependency parsing.
During this talk, we are going to demonstrate some of spaCy's core functionalities by performing a simple NLP analysis on Jane Austen's Pride and Prejudice.
Here's what we will achieve during this analysis:
- Extract the character names from the book (e.g. Elizabeth, Darcy, Bingley)
- Visualise character occurrences with regards to their relative position in the book (e.g. are specific characters mentioned more in the beginning of the book and others more towards the end?)
- Describe Mr Darcy's character using syntactic dependencies
---
2. AGENDA
1. NLP analysis of Pride & Prejudice
○ Introduction to spaCy API
○ Extract characters and visualize them relative to their position in the book
○ Extract adjectives that describes a character in the book
2. How we use spaCy at Cytora
3. Pride & Prejudice by Jane Austen
What is the book about?
○ 5 unmarried Bennet daughters
○ 2 young, wealthy gentlemen (Mr Bingley & Mr
Darcy) move into their neighbourhood
○ The oldest Bennet daughters (Jane & Elizabeth)
become involved with said gentlemen
4.
5. Recreate the plot in 10 lines of code!
1. Parse text
2. Extract named entities
3. Keep only personal named entities
4. Get offset for every extracted entity
5. Plot the graph
6. 1. Parse text
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
7. 2. Extract named entities
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
for ent in processed_text.ents[:7]:
print(ent.text, ent.label_)
Output:
The Project Gutenberg EBook of ORG
Jane Austen PERSON
the Project Gutenberg License ORG
www.gutenberg.org FAC
Pride ORG
Jane Austen PERSON
August 26, 2008 DATE
8. 3. Keep only personal named entities
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
for ent in processed_text.ents[300:310]:
if ent.label_ == 'PERSON':
print(ent.text, ent.label_)
Output:
Bingley PERSON
Elizabeth PERSON
Darcy PERSON
William Lucas PERSON
Darcy PERSON
9. 4. Get offset for every extracted entity
...
processed_text = nlp(text)
character_offsets = defaultdict(list)
for ent in processed_text.ents:
if ent.label_ == 'PERSON':
character_offsets[ent.text].append(ent.start)
print(character_offsets['Elizabeth'][:5])
print(character_offsets['Darcy'][:5])
print(processed_text[1422])
print(processed_text[3229])
Output:
[1422, 3670, 3759, 3867, 4532]
[3005, 3229, 3367, 3410, 3754]
Elizabeth
Darcy
10. 5. Plot the graph
from collections import defaultdict
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
character_offsets = defaultdict(list)
for ent in processed_text.ents:
if ent.label_ == 'PERSON':
character_offsets[ent.lemma_].append(ent.start)
plot_character_timeseries(character_offsets, ['darcy', 'bingley'])
13. Describe Mr Darcy
● Automatically describe Mr Darcy (e.g. silent, tall, young, etc)
● We can solve this problem using syntactic dependencies that are part of
spaCy API
● Syntactic dependencies could be very nicely visualized with displaCy
15. Extract all ‘amod’ dependencies in entities subtree
darcy_adjectives = []
darcy_ents = [ent for ent in processed_text.ents if
ent.lemma_ == 'darcy']
for ent in darcy_ents:
for token in ent.subtree:
if token.dep_ == 'amod':
darcy_adjectives.append(token.lemma_)
print(set(darcy_adjectives))
Output:
{'handsome', 'last', 'grave', 'silent',
'particular', 'young', 'poor',
'abominable', 'disappointing',
'disagreeable', 'confidential', 'late',
'little', 'charming', 'present',
'intimate'}
17. Extract all ‘acomp’ from entity’s root subtree
for ent in darcy_ents:
if ent.root.dep_ == 'nsubj':
for child in ent.root.head.children:
if child.dep_ == 'acomp':
darcy_adjectives.append(child.lemma_)
Output:
{'kind', 'ashamed', 'impatient',
'answerable', 'sorry', 'unworthy',
'grow', 'fond', 'proud', 'engaged',
'little', 'clever', 'worth', 'tall',
'studious', 'punctual'}
18. Pros & Cons of syntactic dependencies approach
● Training dataset is not needed
● Intuitive
● From our experiences, you can
achieve decent extraction
precision
● Our approach achieved very
poor recall
● Spacy dependency parsing
always works inside a single
sentence only
20. spaCy at Cytora
● We process 2M documents everyday with spaCy
● Named entity recognition (geolocations, actors)
● Dependency parsing (impact metric extraction)
● Integrated Word Embeddings (preprocessing for DL models)
21. Cytora is hiring!
● Data Engineer
● Data Science Analyst
● Risk Modeler
All open positions