NLP in 10 lines of code

•

3 gefällt mir•971 views

At Cytora, our production system works 24/7 to transform billions of pieces of unstructured web data into structured data sets. This is a huge job, and we use spaCy to help us on a daily basis. SpaCy is an easy-to-use open source Python NLP library that excels at large-scale information extraction. It supports tokenization, sentence segmentation, named entity recognition, part of speech tagging and dependency parsing. During this talk, we are going to demonstrate some of spaCy's core functionalities by performing a simple NLP analysis on Jane Austen's Pride and Prejudice. Here's what we will achieve during this analysis: - Extract the character names from the book (e.g. Elizabeth, Darcy, Bingley) - Visualise character occurrences with regards to their relative position in the book (e.g. are specific characters mentioned more in the beginning of the book and others more towards the end?) - Describe Mr Darcy's character using syntactic dependencies ---

Technologie

NLP in 10 lines of code
Andraž Hribernik

AGENDA
1. NLP analysis of Pride & Prejudice
○ Introduction to spaCy API
○ Extract characters and visualize them relative to their position in the book
○ Extract adjectives that describes a character in the book
2. How we use spaCy at Cytora

Pride & Prejudice by Jane Austen
What is the book about?
○ 5 unmarried Bennet daughters
○ 2 young, wealthy gentlemen (Mr Bingley & Mr
Darcy) move into their neighbourhood
○ The oldest Bennet daughters (Jane & Elizabeth)
become involved with said gentlemen

Recreate the plot in 10 lines of code!
1. Parse text
2. Extract named entities
3. Keep only personal named entities
4. Get offset for every extracted entity
5. Plot the graph

1. Parse text
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)

2. Extract named entities
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
for ent in processed_text.ents[:7]:
print(ent.text, ent.label_)
Output:
The Project Gutenberg EBook of ORG
Jane Austen PERSON
the Project Gutenberg License ORG
www.gutenberg.org FAC
Pride ORG
Jane Austen PERSON
August 26, 2008 DATE

3. Keep only personal named entities
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
for ent in processed_text.ents[300:310]:
if ent.label_ == 'PERSON':
print(ent.text, ent.label_)
Output:
Bingley PERSON
Elizabeth PERSON
Darcy PERSON
William Lucas PERSON
Darcy PERSON

4. Get offset for every extracted entity
...
processed_text = nlp(text)
character_offsets = defaultdict(list)
for ent in processed_text.ents:
if ent.label_ == 'PERSON':
character_offsets[ent.text].append(ent.start)
print(character_offsets['Elizabeth'][:5])
print(character_offsets['Darcy'][:5])
print(processed_text[1422])
print(processed_text[3229])
Output:
[1422, 3670, 3759, 3867, 4532]
[3005, 3229, 3367, 3410, 3754]
Elizabeth
Darcy

5. Plot the graph
from collections import defaultdict
import spacy
nlp = spacy.load('en')
text = open('pride_and_prejudice.txt').read()
processed_text = nlp(text)
character_offsets = defaultdict(list)
for ent in processed_text.ents:
if ent.label_ == 'PERSON':
character_offsets[ent.lemma_].append(ent.start)
plot_character_timeseries(character_offsets, ['darcy', 'bingley'])

Describe Mr Darcy
● Automatically describe Mr Darcy (e.g. silent, tall, young, etc)
● We can solve this problem using syntactic dependencies that are part of
spaCy API
● Syntactic dependencies could be very nicely visualized with displaCy

$Extract all ‘amod’ dependencies in entities subtree darcy_adjectives = [] darcy_ents = [ent for ent in processed_text.ents if ent.lemma_ == 'darcy'] for ent in darcy_ents: for token in ent.subtree: if token.dep_ == 'amod': darcy_adjectives.append(token.lemma_) print(set(darcy_adjectives)) Output: {'handsome', 'last', 'grave', 'silent', 'particular', 'young', 'poor', 'abominable', 'disappointing', 'disagreeable', 'confidential', 'late', 'little', 'charming', 'present', 'intimate'}$

Describe Mr Darcy
adjective complement
noun subject

$Extract all ‘acomp’ from entity’s root subtree for ent in darcy_ents: if ent.root.dep_ == 'nsubj': for child in ent.root.head.children: if child.dep_ == 'acomp': darcy_adjectives.append(child.lemma_) Output: {'kind', 'ashamed', 'impatient', 'answerable', 'sorry', 'unworthy', 'grow', 'fond', 'proud', 'engaged', 'little', 'clever', 'worth', 'tall', 'studious', 'punctual'}$

Pros & Cons of syntactic dependencies approach
● Training dataset is not needed
● Intuitive
● From our experiences, you can
achieve decent extraction
precision
● Our approach achieved very
poor recall
● Spacy dependency parsing
always works inside a single
sentence only

spaCy at Cytora
● We process 2M documents everyday with spaCy
● Named entity recognition (geolocations, actors)
● Dependency parsing (impact metric extraction)
● Integrated Word Embeddings (preprocessing for DL models)

Cytora is hiring!
● Data Engineer
● Data Science Analyst
● Risk Modeler
All open positions

Thank you!
https://github.com/cytora/pycon-nlp-in-10-lines
https://spacy.io/
https://demos.explosion.ai/displacy/
http://www.cytora.com/
andraz@cytora.com

Weitere ähnliche Inhalte

Andere mochten auch

How to use NLP in BusinessMorgan PR

What is Neuro Linguistic Programming (NLP)Fiona Campbell

Applications of NLP: Part-10 By Ms. Rukmini Iyer Health Education Library for People

Applications of NLP: Part 8Health Education Library for People

Rich relational data from thin air john stinsonJohn Stinson

NLP for Everyday PeopleRebecca Bilbro

Natural language processingHansi Thenuwara

Deep Learning, an interactive introduction for NLP-ersRoelof Pieters

Natural language processingprashantdahake

NLP for Business Owners/Enterpreneurs : Applying Neuro Linguistic Programming...Dr. Andi Chaidir, S.Si, MBA, Ph.D., CCEO (prov)

Advanced Communications Using NLP MethodsDr.Arivalan Ramaiyah

Introduction to Natural Language Processingrohitnayak

Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...Fiona Campbell

NLPguestff64339

Neuro linguistic programming(nlp)SYED TANZIL HAIDER ZAIDI

Introduction to Natural Language ProcessingPranav Gupta

Slides For Nlp(Anchoring)Alwyn Lau

The State of AI 2016Ines Montani

Natural Language ProcessingJaganadh Gopinadhan

150 Tips Tricks and Ideas for Personal BrandingKyle Lacy

Andere mochten auch (20)

How to use NLP in Business

What is Neuro Linguistic Programming (NLP)

Applications of NLP: Part-10 By Ms. Rukmini Iyer

Applications of NLP: Part 8

Rich relational data from thin air john stinson

NLP for Everyday People

Natural language processing

Deep Learning, an interactive introduction for NLP-ers

Natural language processing

NLP for Business Owners/Enterpreneurs : Applying Neuro Linguistic Programming...

Advanced Communications Using NLP Methods

Introduction to Natural Language Processing

Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...

NLP

Neuro linguistic programming(nlp)

Introduction to Natural Language Processing

Slides For Nlp(Anchoring)

The State of AI 2016

Natural Language Processing

150 Tips Tricks and Ideas for Personal Branding

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

WordPress Websites for Engineers: Elevate Your Brandgvaughan

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Story boards and shot lists for my a level piececharlottematthew16

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

How to write a Business Continuity PlanDatabarracks

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems

Are Multi-Cloud and Serverless Good or Bad?

Vertex AI Gemini Prompt Engineering Tips

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

SAP Build Work Zone - Overview L2-L3.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

WordPress Websites for Engineers: Elevate Your Brand

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Story boards and shot lists for my a level piece

DevoxxFR 2024 Reproducible Builds with Apache Maven

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Streamlining Python Development: A Guide to a Modern Project Setup

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

The Ultimate Guide to Choosing WordPress Pros and Cons

How to write a Business Continuity Plan

Designing IA for AI - Information Architecture Conference 2024

Powerpoint exploring the locations used in television show Time Clash

Nell’iperspazio con Rocket: il Framework Web di Rust!

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Debugging python applications inside k8s environment", Andrii Soldatenko

NLP in 10 lines of code

1. NLP in 10 lines of code Andraž Hribernik

2. AGENDA 1. NLP analysis of Pride & Prejudice ○ Introduction to spaCy API ○ Extract characters and visualize them relative to their position in the book ○ Extract adjectives that describes a character in the book 2. How we use spaCy at Cytora

3. Pride & Prejudice by Jane Austen What is the book about? ○ 5 unmarried Bennet daughters ○ 2 young, wealthy gentlemen (Mr Bingley & Mr Darcy) move into their neighbourhood ○ The oldest Bennet daughters (Jane & Elizabeth) become involved with said gentlemen

5. Recreate the plot in 10 lines of code! 1. Parse text 2. Extract named entities 3. Keep only personal named entities 4. Get offset for every extracted entity 5. Plot the graph

6. 1. Parse text import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text)

7. 2. Extract named entities import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) for ent in processed_text.ents[:7]: print(ent.text, ent.label_) Output: The Project Gutenberg EBook of ORG Jane Austen PERSON the Project Gutenberg License ORG www.gutenberg.org FAC Pride ORG Jane Austen PERSON August 26, 2008 DATE

8. 3. Keep only personal named entities import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) for ent in processed_text.ents[300:310]: if ent.label_ == 'PERSON': print(ent.text, ent.label_) Output: Bingley PERSON Elizabeth PERSON Darcy PERSON William Lucas PERSON Darcy PERSON

9. 4. Get offset for every extracted entity ... processed_text = nlp(text) character_offsets = defaultdict(list) for ent in processed_text.ents: if ent.label_ == 'PERSON': character_offsets[ent.text].append(ent.start) print(character_offsets['Elizabeth'][:5]) print(character_offsets['Darcy'][:5]) print(processed_text[1422]) print(processed_text[3229]) Output: [1422, 3670, 3759, 3867, 4532] [3005, 3229, 3367, 3410, 3754] Elizabeth Darcy

10. 5. Plot the graph from collections import defaultdict import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) character_offsets = defaultdict(list) for ent in processed_text.ents: if ent.label_ == 'PERSON': character_offsets[ent.lemma_].append(ent.start) plot_character_timeseries(character_offsets, ['darcy', 'bingley'])

11. Demo

12. Describe Mr Darcy

13. Describe Mr Darcy ● Automatically describe Mr Darcy (e.g. silent, tall, young, etc) ● We can solve this problem using syntactic dependencies that are part of spaCy API ● Syntactic dependencies could be very nicely visualized with displaCy

14. Describe Mr Darcy adjective modifier

15. Extract all ‘amod’ dependencies in entities subtree darcy_adjectives = [] darcy_ents = [ent for ent in processed_text.ents if ent.lemma_ == 'darcy'] for ent in darcy_ents: for token in ent.subtree: if token.dep_ == 'amod': darcy_adjectives.append(token.lemma_) print(set(darcy_adjectives)) Output: {'handsome', 'last', 'grave', 'silent', 'particular', 'young', 'poor', 'abominable', 'disappointing', 'disagreeable', 'confidential', 'late', 'little', 'charming', 'present', 'intimate'}

16. Describe Mr Darcy adjective complement noun subject

17. Extract all ‘acomp’ from entity’s root subtree for ent in darcy_ents: if ent.root.dep_ == 'nsubj': for child in ent.root.head.children: if child.dep_ == 'acomp': darcy_adjectives.append(child.lemma_) Output: {'kind', 'ashamed', 'impatient', 'answerable', 'sorry', 'unworthy', 'grow', 'fond', 'proud', 'engaged', 'little', 'clever', 'worth', 'tall', 'studious', 'punctual'}

18. Pros & Cons of syntactic dependencies approach ● Training dataset is not needed ● Intuitive ● From our experiences, you can achieve decent extraction precision ● Our approach achieved very poor recall ● Spacy dependency parsing always works inside a single sentence only

19. What is our mission at Cytora?

20. spaCy at Cytora ● We process 2M documents everyday with spaCy ● Named entity recognition (geolocations, actors) ● Dependency parsing (impact metric extraction) ● Integrated Word Embeddings (preprocessing for DL models)

21. Cytora is hiring! ● Data Engineer ● Data Science Analyst ● Risk Modeler All open positions

22. Thank you! https://github.com/cytora/pycon-nlp-in-10-lines https://spacy.io/ https://demos.explosion.ai/displacy/ http://www.cytora.com/ andraz@cytora.com

NLP in 10 lines of code

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

NLP in 10 lines of code