SlideShare ist ein Scribd-Unternehmen logo
1 von 23
How to prepare data for NLP
Loryfel Nunez
@lorynyc
California Gold Rush
“
Extracting actionable information
from modern big data sets requires the
equivalent processing infrastructure of
extracting a nugget of GOLD from a mountain of DIRT.
Nikolas Markou
(via LInkedIn)
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
How does it work,
anyway?1
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
Controlling the input
Document Unit
Representation of text
Inside the Machine
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
BREAK IT DOWN
2
Let’s Break it Down
á Novák
Novák and
Kline Smith acquires shares of Novak
and Kline for $10.99 per share.
Smith acquires shares of
Novak and Kline for $10.99 per
share.
Smith Inc. acquires shares of
Novak and Kline for $10.99 per
share.
Smith acquires common
shares of N & K for
$10.99/share.
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
… if possible
2
Character
á
&amp;
Do you know the encoding of your input data?
◉User tells you
◉Metadata
◉Figure it out (using chardet, or similar)
◉Have your own heuristics
Tokens
Forty-two, 42
Post-colonial, postcolonial
eBay, Ebay, EBAY, ebay
Fed, FED, fed
C.A.T., CAT
Heuristics
Mappings
Transformations
numToWord, POS (from
SpaCy or NLTK)
Tokens
STEMMING vs LEMMATIZATION
import spacy
from nltk.stem.porter import PorterStemmer
nlp = spacy.load('en')
stemmer = PorterStemmer()
doc = nlp(u'She is an intelligence operative.')
for word in doc:
stemmed = stemmer.stem(word.text)
print(word.text, " LEMMA => ", word.lemma_, "
STEM => ", stemmed)
She LEMMA => -PRON- STEM => she
is LEMMA => be STEM => is
an LEMMA => an STEM => an
intelligence LEMMA => intelligence STEM => intellig
operative LEMMA => operative STEM => oper
. LEMMA => . STEM => .
SpaCy, NLTK
Entities
Novak and Kline, NK,
NYSE:NK, Test Company
June 30, 2017
06/30/2017
30/6/2017
Smith acquires shares of Novak and Kline for
$10.99 per share .
Smith acquires shares of NK for $10.99 per
share .
ORG acquires shares of ORG for $10.99 per share
.
Hot or Not
REMOVING HIGHLIGHTING
WORDS Emails, dates, URLs,
stop words
hotwords
More than WORDS tables Hot patterns
textacy
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
IRL
{‘title’: ‘Smith Buys …’,
‘original_text’: ‘LONDON --- Smith..’,
‘transformed_text’: {
‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,
‘lemmatized’: ‘Smith Inc acquire share..’
‘has_acquired: true
},
‘table’: ‘<table>….. </table>’
}
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
-- how algorithms see text
-- from bytes to documents
-- patterns, normalization, metadata, actions
(replace, remove, highlight)
◉ Stanford NLP Group
◉ Spacy Documentation
◉ SciKit Learn Documentation
◉ The hard knocks of NLP projects
References and other stuff
Any questions ?
You can find me at
◉ @lorynyc
◉ loryn808@gmail.com
Thanks!

Weitere ähnliche Inhalte

Mehr von Pôle Systematic Paris-Region

Mehr von Pôle Systematic Paris-Region (20)

Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
 
Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
 
Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
 
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
 
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris FeldPyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
 
Py paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas MullerPy paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas Muller
 
PyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip SchanelyPyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip Schanely
 
PyParis2017 / Performant python, by Burkhard Kloss
PyParis2017 / Performant python, by Burkhard KlossPyParis2017 / Performant python, by Burkhard Kloss
PyParis2017 / Performant python, by Burkhard Kloss
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav SinghPyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

  • 1. How to prepare data for NLP Loryfel Nunez @lorynyc
  • 3. “ Extracting actionable information from modern big data sets requires the equivalent processing infrastructure of extracting a nugget of GOLD from a mountain of DIRT. Nikolas Markou (via LInkedIn)
  • 4. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2
  • 5. How does it work, anyway?1
  • 6. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  • 7. Controlling the input Document Unit Representation of text
  • 8. Inside the Machine Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share .
  • 10. Let’s Break it Down á Novák Novák and Kline Smith acquires shares of Novak and Kline for $10.99 per share. Smith acquires shares of Novak and Kline for $10.99 per share. Smith Inc. acquires shares of Novak and Kline for $10.99 per share. Smith acquires common shares of N &amp; K for $10.99/share.
  • 11. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  • 13. Character á &amp; Do you know the encoding of your input data? ◉User tells you ◉Metadata ◉Figure it out (using chardet, or similar) ◉Have your own heuristics
  • 14. Tokens Forty-two, 42 Post-colonial, postcolonial eBay, Ebay, EBAY, ebay Fed, FED, fed C.A.T., CAT Heuristics Mappings Transformations numToWord, POS (from SpaCy or NLTK)
  • 15. Tokens STEMMING vs LEMMATIZATION import spacy from nltk.stem.porter import PorterStemmer nlp = spacy.load('en') stemmer = PorterStemmer() doc = nlp(u'She is an intelligence operative.') for word in doc: stemmed = stemmer.stem(word.text) print(word.text, " LEMMA => ", word.lemma_, " STEM => ", stemmed) She LEMMA => -PRON- STEM => she is LEMMA => be STEM => is an LEMMA => an STEM => an intelligence LEMMA => intelligence STEM => intellig operative LEMMA => operative STEM => oper . LEMMA => . STEM => . SpaCy, NLTK
  • 16. Entities Novak and Kline, NK, NYSE:NK, Test Company June 30, 2017 06/30/2017 30/6/2017 Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of NK for $10.99 per share . ORG acquires shares of ORG for $10.99 per share .
  • 17. Hot or Not REMOVING HIGHLIGHTING WORDS Emails, dates, URLs, stop words hotwords More than WORDS tables Hot patterns textacy
  • 18. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  • 19. IRL {‘title’: ‘Smith Buys …’, ‘original_text’: ‘LONDON --- Smith..’, ‘transformed_text’: { ‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘, ‘lemmatized’: ‘Smith Inc acquire share..’ ‘has_acquired: true }, ‘table’: ‘<table>….. </table>’ }
  • 20. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  • 21. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2 -- how algorithms see text -- from bytes to documents -- patterns, normalization, metadata, actions (replace, remove, highlight)
  • 22. ◉ Stanford NLP Group ◉ Spacy Documentation ◉ SciKit Learn Documentation ◉ The hard knocks of NLP projects References and other stuff
  • 23. Any questions ? You can find me at ◉ @lorynyc ◉ loryn808@gmail.com Thanks!