SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
Distributed
Natural Language Processing
Systems in Python
Clare Corthell,Founder of Luminant Data
@clarecorthell
thinkingmachin.es/events
Luminant Data
is a Machine Intelligence Consultancy
building more intelligent businesses
with data science strategy and
artificial intelligence technology
luminantdata.com
based in San Francisco
Clare Corthell, Founder
datasciencemasters.org
The Open Source Data Science Masters
The most popular curriculum for
learning data science
Author: Clare Corthell
development environment
python 2.7
pip, pandas, iPython, TextBlob, scikit-learn
@clarecorthell
hold on to your seats,
we’re doing NLP end-to-end
@clarecorthell
What is natural language processing 4?
@clarecorthell
how does this artificial assistant
know what you’re talking about?@clarecorthell
It’s simple, right?
@clarecorthell
wired.com
Natural Language Processing is the domain
concerned with translating human language
into something a computer can process
@clarecorthell
teaching computers to better
understand human language
today —
@clarecorthell
Natural Language Processing
Code
1.Intuition forTextAnalysis
2.Representations ofText for Computation
Case Study
3.Mining theWeb for ObscureWords
4.NLP in Production
@clarecorthell
Intuition for Text Analysis
one natural language processing
Worked Example:
http://github.com/clarecorthell/nlp_workshop
@clarecorthell
Representations of Text
for Computation
two natural language processing
Worked Example:
http://github.com/clarecorthell/nlp_workshop
@clarecorthell
Case Study
Mining the Web for Obscure Words
Finding new definitions of words for the Wordnik dictionary
codename: Serapis
three natural language processing
@clarecorthell
Important Note:
We combine Natural Language Processing
and Machine Learning in this framework
@clarecorthell
Wordnik’s Challenge
Find 1 million new words that are already defined online
@clarecorthell
The term “cheeseors” describes flighted
globules of intergalactic cheese, known to be
the scourge of the asteroid belt.
- Erin McKean, Wordnik Founder
what does this word mean?
@clarecorthell
Free-Range Definition or “FRD”
a sentence that contains and contextually defines a word
@clarecorthell
we’re given words without definitions (in Wordnik)
FRD sentences for that word (from the internet)we want
the natural language challenge:
@clarecorthell
This is a Supervised Classification Problem
Given a bunch of sentences containing a given word, we
want to know which ones are an FRD and which are not.
P(FRD | Sentence) = [0.0,1.0]
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
new
input
Supervised Machine Learning:
Classification
test
input
learn* the difference
between labels
* Learning means finding the functions that define the difference between training labels, based on training input.
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
predict
predict
abstract test quality of model
test
predictions
output
classifications
@clarecorthell
To get from word to FRD,
What data do we need?
@clarecorthell
real world text data is messy
user lookups from Wordnik
@clarecorthell
word —> sentence with word
search the web for
@clarecorthell
real world text data is messy
the internet
@clarecorthell
“ephemerable”
@clarecorthell
Both are messy!
We need to do some pre-processing.
Pre-processing is often custom, determined by:
- the domain of the text
- your goals
- dealing with edge cases
@clarecorthell
What does the solution look like so far?
@clarecorthell
Processing Natural
Language for Wordnik
Search Bing
qualify term
raw page
documents
word
cleaned
documents
sentences
containing
word
parse, tokenize,
sanitize text
word
english?
pre-process documents
find body text
replace term
in sentences
store for
learning or prediction
tag parts-of-speech sentences:
part-of-speech
@clarecorthell
All with the intent of creating features!
@clarecorthell
What are Features?
individual measurable properties of the
thing we’re observing.
@clarecorthell
The point of feature development and extraction
is to build derived values from the data
that represent characteristics that identify the data
or differentiate data points from one another
in such a way that the computer can observe that difference
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
features
new
input
Supervised Machine Learning:
Classification
test
input
predicted
labels
test labels
metrics⊥
feature
extractor
learn*thedifference
betweenlabels
* Learning means finding the functions that define the difference between training labels, based on training input.
data
translations
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
feature
extractor
data
translations
features
predict
predict
design
output
classifications
@clarecorthell
How do we construct features
that will differentiate our examples?
We have to get creative, make guesses, and
statistically test them to see what features will work
@clarecorthell
The term “cheeseors” describes flighted
globules of intergalactic cheese, known to be
the scourge of the asteroid belt.
What features do sentences with FRDs have?
@clarecorthell
Possible Feature: Length of Sentence
@clarecorthell
Possible Features: Length, Location, Position in Sentence
Chi-Squared (chi^2) Selection
Chi Squared is a statistical test that describes how
independent two given events are.
In selecting features, the two events are occurrence of the term
and occurrence of the class. If the score is high or significant, it
means that the occurrence is dependent on the class.
@clarecorthell
highest chi^2 feature scoring for tokens
@clarecorthell
highest chi^2 feature scoring for parts of speech
@clarecorthell
Final features for FRDs:
token context around the term (tf-idf)
POS tag context around the term
@clarecorthell
task.search collects URLs and the documents
behind those URLs for a given word
task.detect finds all sentences that are FRDs
within all documents collected for a given word
Final NLP Modules
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
features
new
input
Supervised Machine Learning:
Classification
test
input
predicted
labels
test labels
metrics⊥
feature
extractor
learn*thedifference
betweenlabels
* Learning means finding the functions that define the difference between training labels, based on training input.
data
translations
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
feature
extractor
data
translations
features
predict
predict
design
output
classifications
task.detect
@clarecorthell
Keys to Success in Supervised Learning:
• knowing what an FRD is
• knowing how to encode the FRD for computation
• using the right data sources
• choosing the right features
@clarecorthell
Ready to put it all together?
@clarecorthell
NLP in Production
four natural language processing
Lessons and Patterns from the distributed Wordnik system
codename: Serapis
@clarecorthell
Serapis is the Graeco-Egyptian god of knowledge, education —
and adding a million words into the dictionary.
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
each page request takes on average 2 sec, so 40,000,000 pages * 2 sec
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
each page request takes on average 2 sec, so 40,000,000 pages * 2 sec
In series, it would take 5Years to get all the documents we want
@clarecorthell
Q: Can’t we just run the same thing on a server?
Scale and Page Requests
If we want to get search results for 1 million words, it would
take us 5 years to get all the documents if we processed
everything sequentially. We need to parallelize.
A: Nope.
@clarecorthell
Luminant Data &
Distributed Infrastructure for data collection, natural
language processing, and machine learning
resizable compute compute management
service
scalable storage index
@clarecorthell
AWS Lambda is a compute service that runs your
code in response to events and automatically manages
the underlying compute resources for you.
The promise of Lambda is that you don’t have to worry
about infrastructure, rather you set a task and a trigger,
such as a change to a document in an S3 bucket.
Lambda takes over scaling out the resources to
complete all the tasks.
@clarecorthell
1. Start an EC2 instance with the Amazon Linux AMI
2. Build shared libraries from source on EC2
3. Create a virtualenv with python dependencies
4. Write a python handler function to respond to a
given event and do its work (process text, etc)
5. Bundle the virtualenv, your code and the binary libs
into a zip file
6. Publish the zip file to AWS Lambda
Voila.
@clarecorthell
Luminant Data &
Distributed Infrastructure for data collection, natural
language processing, and machine learning
@clarecorthell
It has developed a mechanism to 'dye' very small bitcoin
transactions (called ‘bitcoin dust’) by adding extra data to them so
that they can represent bonds, shares or units of precious metals.
A Few New Words from FRDs
‘oxt weekend,’ in other words, means 'not this coming weekend
but the one after.'
This summer, a new, trendier one, emerged: NATU, for Netflix,
Airbnb, Tesla and Uber.
@clarecorthell
Clare Corthell
clare@luminantdata.com
@clarecorthell
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingFrank Cunha
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big DataDevin Bost
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
Teaching AI about human knowledge
Teaching AI about human knowledgeTeaching AI about human knowledge
Teaching AI about human knowledgeInes Montani
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3Raven Jiang
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4jKenny Bastani
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)Buhwan Jeong
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 

Was ist angesagt? (20)

Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big Data
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Teaching AI about human knowledge
Teaching AI about human knowledgeTeaching AI about human knowledge
Teaching AI about human knowledge
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4j
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 

Andere mochten auch

Scaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansScaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansClare Corthell
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Hybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmHybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmClare Corthell
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessClare Corthell
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
721P1_daic_revision3
721P1_daic_revision3721P1_daic_revision3
721P1_daic_revision3Dai Chen
 
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactSend Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactTaras Lyapun
 
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...it-people
 
Прямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаПрямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаAlexey Lustin
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Pat Hermens
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi
 
Designing Invisible Software at x.ai
Designing Invisible Software at x.aiDesigning Invisible Software at x.ai
Designing Invisible Software at x.aiAlex Poon
 
Автоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchАвтоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchPositive Hack Days
 
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIUsing Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIVMware Tanzu
 
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийShadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийVyacheslav Nikulin
 
Optimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingOptimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingAlex Poon
 
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с ElasticsearchОмские ИТ-субботники
 

Andere mochten auch (20)

Scaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansScaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for Humans
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Hybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmHybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New Paradigm
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing Fairness
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
721P1_daic_revision3
721P1_daic_revision3721P1_daic_revision3
721P1_daic_revision3
 
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactSend Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
 
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
 
Прямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаПрямая выгода BigData для бизнеса
Прямая выгода BigData для бизнеса
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
 
Designing Invisible Software at x.ai
Designing Invisible Software at x.aiDesigning Invisible Software at x.ai
Designing Invisible Software at x.ai
 
Pycon UA 2016
Pycon UA 2016Pycon UA 2016
Pycon UA 2016
 
Автоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchАвтоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе Elasticsearch
 
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIUsing Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
 
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийShadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
 
Optimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingOptimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language Processing
 
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
 

Ähnlich wie Distributed Natural Language Processing Systems in Python

The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!FRANKLINODURO
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?Ruben Verborgh
 
Patterns of Semantic Integration
Patterns of Semantic IntegrationPatterns of Semantic Integration
Patterns of Semantic IntegrationOptum
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVARobert McDermott
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVARobert McDermott
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageAccenture | SolutionsIQ
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Computer Programming for Lawyers
Computer Programming for LawyersComputer Programming for Lawyers
Computer Programming for LawyersNehal Madhani
 

Ähnlich wie Distributed Natural Language Processing Systems in Python (20)

The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Patterns of Semantic Integration
Patterns of Semantic IntegrationPatterns of Semantic Integration
Patterns of Semantic Integration
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional Language
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Computer Programming for Lawyers
Computer Programming for LawyersComputer Programming for Lawyers
Computer Programming for Lawyers
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Distributed Natural Language Processing Systems in Python

  • 1. Distributed Natural Language Processing Systems in Python Clare Corthell,Founder of Luminant Data @clarecorthell thinkingmachin.es/events
  • 2. Luminant Data is a Machine Intelligence Consultancy building more intelligent businesses with data science strategy and artificial intelligence technology luminantdata.com based in San Francisco Clare Corthell, Founder
  • 3. datasciencemasters.org The Open Source Data Science Masters The most popular curriculum for learning data science Author: Clare Corthell
  • 4. development environment python 2.7 pip, pandas, iPython, TextBlob, scikit-learn @clarecorthell
  • 5. hold on to your seats, we’re doing NLP end-to-end @clarecorthell
  • 6. What is natural language processing 4? @clarecorthell
  • 7. how does this artificial assistant know what you’re talking about?@clarecorthell
  • 10. Natural Language Processing is the domain concerned with translating human language into something a computer can process @clarecorthell
  • 11. teaching computers to better understand human language today — @clarecorthell
  • 12. Natural Language Processing Code 1.Intuition forTextAnalysis 2.Representations ofText for Computation Case Study 3.Mining theWeb for ObscureWords 4.NLP in Production @clarecorthell
  • 13. Intuition for Text Analysis one natural language processing Worked Example: http://github.com/clarecorthell/nlp_workshop @clarecorthell
  • 14. Representations of Text for Computation two natural language processing Worked Example: http://github.com/clarecorthell/nlp_workshop @clarecorthell
  • 15. Case Study Mining the Web for Obscure Words Finding new definitions of words for the Wordnik dictionary codename: Serapis three natural language processing @clarecorthell
  • 16. Important Note: We combine Natural Language Processing and Machine Learning in this framework @clarecorthell
  • 17. Wordnik’s Challenge Find 1 million new words that are already defined online @clarecorthell
  • 18. The term “cheeseors” describes flighted globules of intergalactic cheese, known to be the scourge of the asteroid belt. - Erin McKean, Wordnik Founder what does this word mean? @clarecorthell
  • 19. Free-Range Definition or “FRD” a sentence that contains and contextually defines a word @clarecorthell
  • 20. we’re given words without definitions (in Wordnik) FRD sentences for that word (from the internet)we want the natural language challenge: @clarecorthell
  • 21. This is a Supervised Classification Problem Given a bunch of sentences containing a given word, we want to know which ones are an FRD and which are not. P(FRD | Sentence) = [0.0,1.0] @clarecorthell
  • 22. Development: Training Production: Prediction training labels classificationalgorithm training input new input Supervised Machine Learning: Classification test input learn* the difference between labels * Learning means finding the functions that define the difference between training labels, based on training input. Development: Testing once tests demonstrate the classification algorithm works well, push to production predict predict abstract test quality of model test predictions output classifications @clarecorthell
  • 23. To get from word to FRD, What data do we need? @clarecorthell
  • 24. real world text data is messy user lookups from Wordnik @clarecorthell
  • 25. word —> sentence with word search the web for @clarecorthell
  • 26. real world text data is messy the internet @clarecorthell
  • 28. Both are messy! We need to do some pre-processing. Pre-processing is often custom, determined by: - the domain of the text - your goals - dealing with edge cases @clarecorthell
  • 29. What does the solution look like so far? @clarecorthell
  • 30. Processing Natural Language for Wordnik Search Bing qualify term raw page documents word cleaned documents sentences containing word parse, tokenize, sanitize text word english? pre-process documents find body text replace term in sentences store for learning or prediction tag parts-of-speech sentences: part-of-speech @clarecorthell
  • 31. All with the intent of creating features! @clarecorthell
  • 32. What are Features? individual measurable properties of the thing we’re observing. @clarecorthell
  • 33. The point of feature development and extraction is to build derived values from the data that represent characteristics that identify the data or differentiate data points from one another in such a way that the computer can observe that difference @clarecorthell
  • 34. Development: Training Production: Prediction training labels classificationalgorithm training input features new input Supervised Machine Learning: Classification test input predicted labels test labels metrics⊥ feature extractor learn*thedifference betweenlabels * Learning means finding the functions that define the difference between training labels, based on training input. data translations Development: Testing once tests demonstrate the classification algorithm works well, push to production feature extractor data translations features predict predict design output classifications @clarecorthell
  • 35. How do we construct features that will differentiate our examples? We have to get creative, make guesses, and statistically test them to see what features will work @clarecorthell
  • 36. The term “cheeseors” describes flighted globules of intergalactic cheese, known to be the scourge of the asteroid belt. What features do sentences with FRDs have? @clarecorthell
  • 37. Possible Feature: Length of Sentence @clarecorthell
  • 38. Possible Features: Length, Location, Position in Sentence Chi-Squared (chi^2) Selection Chi Squared is a statistical test that describes how independent two given events are. In selecting features, the two events are occurrence of the term and occurrence of the class. If the score is high or significant, it means that the occurrence is dependent on the class. @clarecorthell
  • 39. highest chi^2 feature scoring for tokens @clarecorthell
  • 40. highest chi^2 feature scoring for parts of speech @clarecorthell
  • 41. Final features for FRDs: token context around the term (tf-idf) POS tag context around the term @clarecorthell
  • 42. task.search collects URLs and the documents behind those URLs for a given word task.detect finds all sentences that are FRDs within all documents collected for a given word Final NLP Modules @clarecorthell
  • 43. Development: Training Production: Prediction training labels classificationalgorithm training input features new input Supervised Machine Learning: Classification test input predicted labels test labels metrics⊥ feature extractor learn*thedifference betweenlabels * Learning means finding the functions that define the difference between training labels, based on training input. data translations Development: Testing once tests demonstrate the classification algorithm works well, push to production feature extractor data translations features predict predict design output classifications task.detect @clarecorthell
  • 44. Keys to Success in Supervised Learning: • knowing what an FRD is • knowing how to encode the FRD for computation • using the right data sources • choosing the right features @clarecorthell
  • 45. Ready to put it all together? @clarecorthell
  • 46. NLP in Production four natural language processing Lessons and Patterns from the distributed Wordnik system codename: Serapis @clarecorthell
  • 47. Serapis is the Graeco-Egyptian god of knowledge, education — and adding a million words into the dictionary. @clarecorthell
  • 48. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words @clarecorthell
  • 49. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests @clarecorthell
  • 50. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results @clarecorthell
  • 51. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests @clarecorthell
  • 52. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests each page request takes on average 2 sec, so 40,000,000 pages * 2 sec @clarecorthell
  • 53. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests each page request takes on average 2 sec, so 40,000,000 pages * 2 sec In series, it would take 5Years to get all the documents we want @clarecorthell
  • 54. Q: Can’t we just run the same thing on a server? Scale and Page Requests If we want to get search results for 1 million words, it would take us 5 years to get all the documents if we processed everything sequentially. We need to parallelize. A: Nope. @clarecorthell
  • 55. Luminant Data & Distributed Infrastructure for data collection, natural language processing, and machine learning resizable compute compute management service scalable storage index @clarecorthell
  • 56. AWS Lambda is a compute service that runs your code in response to events and automatically manages the underlying compute resources for you. The promise of Lambda is that you don’t have to worry about infrastructure, rather you set a task and a trigger, such as a change to a document in an S3 bucket. Lambda takes over scaling out the resources to complete all the tasks. @clarecorthell
  • 57. 1. Start an EC2 instance with the Amazon Linux AMI 2. Build shared libraries from source on EC2 3. Create a virtualenv with python dependencies 4. Write a python handler function to respond to a given event and do its work (process text, etc) 5. Bundle the virtualenv, your code and the binary libs into a zip file 6. Publish the zip file to AWS Lambda Voila. @clarecorthell
  • 58. Luminant Data & Distributed Infrastructure for data collection, natural language processing, and machine learning @clarecorthell
  • 59. It has developed a mechanism to 'dye' very small bitcoin transactions (called ‘bitcoin dust’) by adding extra data to them so that they can represent bonds, shares or units of precious metals. A Few New Words from FRDs ‘oxt weekend,’ in other words, means 'not this coming weekend but the one after.' This summer, a new, trendier one, emerged: NATU, for Netflix, Airbnb, Tesla and Uber. @clarecorthell