3. “
Extracting actionable information
from modern big data sets requires the
equivalent processing infrastructure of
extracting a nugget of GOLD from a mountain of DIRT.
Nikolas Markou
(via LInkedIn)
4. Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
8. Inside the Machine
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
10. Let’s Break it Down
á Novák
Novák and
Kline Smith acquires shares of Novak
and Kline for $10.99 per share.
Smith acquires shares of
Novak and Kline for $10.99 per
share.
Smith Inc. acquires shares of
Novak and Kline for $10.99 per
share.
Smith acquires common
shares of N & K for
$10.99/share.
11. In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
13. Character
á
&
Do you know the encoding of your input data?
◉User tells you
◉Metadata
◉Figure it out (using chardet, or similar)
◉Have your own heuristics
15. Tokens
STEMMING vs LEMMATIZATION
import spacy
from nltk.stem.porter import PorterStemmer
nlp = spacy.load('en')
stemmer = PorterStemmer()
doc = nlp(u'She is an intelligence operative.')
for word in doc:
stemmed = stemmer.stem(word.text)
print(word.text, " LEMMA => ", word.lemma_, "
STEM => ", stemmed)
She LEMMA => -PRON- STEM => she
is LEMMA => be STEM => is
an LEMMA => an STEM => an
intelligence LEMMA => intelligence STEM => intellig
operative LEMMA => operative STEM => oper
. LEMMA => . STEM => .
SpaCy, NLTK
16. Entities
Novak and Kline, NK,
NYSE:NK, Test Company
June 30, 2017
06/30/2017
30/6/2017
Smith acquires shares of Novak and Kline for
$10.99 per share .
Smith acquires shares of NK for $10.99 per
share .
ORG acquires shares of ORG for $10.99 per share
.
17. Hot or Not
REMOVING HIGHLIGHTING
WORDS Emails, dates, URLs,
stop words
hotwords
More than WORDS tables Hot patterns
textacy
18. In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
21. Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
-- how algorithms see text
-- from bytes to documents
-- patterns, normalization, metadata, actions
(replace, remove, highlight)
22. ◉ Stanford NLP Group
◉ Spacy Documentation
◉ SciKit Learn Documentation
◉ The hard knocks of NLP projects
References and other stuff