NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Student Profile Sample report on improving academic performance by uniting gr...
Advanced Natural Language Processing with Spark NLP
1.
2. Advanced Natural Language Processing
with Spark NLP
Alex Thomas, Principal Data Scientist at WiseCube
David Talby, CTO at John Snow Labs
3. Agenda
Introducing Spark NLP
Accuracy, scalability, and speed benchmarks
Out-of-the-box functionality
Getting Things Done
End-to-end NLP tasks in 3 lines of code
Key concepts and a backstage tour
Notebooks!
Using pre-trained pipelines & models
Named entity recognition
Document classification
5. SPARK NLP IN THE ENTERPRISE
O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
6.
7. • ”State of the art” means the best peer-reviewed academic results
• Public benchmarks: Comparing production-grade NLP libraries
• Public benchmarks of pre-trained models: nlp.johnsnowlabs.com
“Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching”
New: Redesigned NER-DL and BERT-large
New: Spark OCR image filters & scalable pipelines
New: Hierarchical clinical entity resolution
“Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis”
New: ALBERT & XLNet embeddings
New: Contextual spell checker
New: DL-based sentiment analysis
ACCURACY
8. SCALABILITY
• Zero code changes to scale a pipeline
to any Spark cluster
• Only natively distributed
open-source NLP library
• Spark provides execution planning,
caching, serialization, shuffling
• Caveats
– Speedup depends heavily on what
you actually do
– Not all algorithms scale well
– Spark configuration matters
9. SPEED: GET THE MOST FROM MODERN HARDWARE
• Optimized builds of Spark NLP
for both Intel and Nvidia
• Benchmark done on AWS:
Train a Named Entity Recognizer in
French
• Achieving F1-score of 89% requires
at least 80 Epochs with batch size of
512
• Intel outperformed Nvidia: Cascade
Lake was 19% faster & 46% cheaper
than Tesla P-100
10. Production Grade + Active Community
In production in multiple Fortune 500’s
26 new releases in 2018, 30 in 2019
Active Slack community
Permissive open source license: Apache 2.0
14. SENTIMENT ANALYSIS
import sparknlp
sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze-sentiment', 'en')
result = pipeline.annotate('Harry Potter is a great movie’)
print(result['sentiment’]) ## will print ['positive']
15. NAMED ENTITY RECOGNITION
pipeline = PretrainedPipeline('recognize_entities_bert', 'en')
result = pipeline.annotate('Harry and Ron met in Hogsmeade')
print(result['ner'])
# prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
16. SPELL CHECKING & CORRECTION
Now in Scala:
val pipeline = PretrainedPipeline("spell_check_ml", "en")
val result = pipeline.annotate("Harry Potter is a graet muvie")
println(result("spell"))
/* will print Seq[String](…, "is", "a", "great", "movie") */
17. UNDER THE HOOD
1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it.
2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline,
the pre-trained models, and the embeddings it depends on.
3. These are stored and cached locally.
4. TensorFlow is initialized, within the same JVM process that runs Spark.
The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are
automatically distributed and shared if running on a cluster.
5. The annotate() call runs an NLP inference pipeline which activates each stage’s
algorithm (tokenization, POS, etc.).
6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for
tokens and a CNN for characters.
7. Embeddings are used to convert contextual tokens into vectors during the NER inference
process.
8. The result object is a plain old local Python dictionary.
18. KEY CONCEPT #1: PIPELINE
A list of text processing steps.
Each step has input and output columns.
Document
Assembler
Sentence
Detector
Tokenizer
Sentiment
Analyzer
text document sentence token sentiment
19. KEY CONCEPT #2: ANNOTATOR
sentiment_detector = SentimentDetector()
.setInputCols(["sentence”])
.setOutputCol("sentiment_score")
.setDictionary(resource_path+"sent.txt")
An object encapsulating one text processing step.
20. KEY CONCEPT #3: RESOURCE
• Trained ML models
• Trained DL networks
• Dictionaries
• Embeddings
• Rules
• Pretrained pipelines
An external file that an annotator needs.
Resources can be shared, cached, and locally stored.
21. KEY CONCEPT #4: PRETRAINED PIPELINE
A pre-built pipeline, with all the annotators and resources it needs.
22. PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT
Initialization
Training data
Resources
Annotator
Pipeline
Run Training
24. Cleaning, Splitting, and Finding Text
+ Understanding Grammar
Run “Spark NLP Basics” notebook
Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
25. Using Pre-trained Pipelines
+ Named Entity Recognition
Run “Entity Recognizer with Deep Learning” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
26. Training your own NER model
Run “NER BERT Training” notebook
Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ
Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
27. Document Classification
+ Universal Sentence Embeddings
Run “Text Classification with ClassifierDL” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
28. WHAT ELSE IS AVAILABLE?
• Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking
to medical terminologies, assertion status detection, and de-identification
• Spark OCR: 20 annotators for image enhancement, layout, and smart editing
29. LEARN MORE: TECHNICAL CASE STUDIES
Improving Patient Flow Forecasting
Automated clinical coding & chart
reviews
Knowledge Extraction from
Pathology Reports
High-accuracy fact extraction
from long financial documents
Improving Mental Health for
HIV-Positive Adolescents
Accelerating Clinical Trial Recruiting
30. NEXT STEPS
1. READ THE DOCS & JOIN SLACK
HTTPS://NLP.JOHNSNOWLABS.COM
2. STAR & FORK THE REPO
GITHUB.COM/JOHNSNOWLABS/SPARK-NLP
3. QUESTIONS? GET IT TOUCH