SlideShare a Scribd company logo
1 of 32
Download to read offline
Advanced Natural Language Processing
with Spark NLP
Alex Thomas, Principal Data Scientist at WiseCube
David Talby, CTO at John Snow Labs
Agenda
Introducing Spark NLP
Accuracy, scalability, and speed benchmarks
Out-of-the-box functionality
Getting Things Done
End-to-end NLP tasks in 3 lines of code
Key concepts and a backstage tour
Notebooks!
Using pre-trained pipelines & models
Named entity recognition
Document classification
INTRODUCING SPARK NLP
STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA
1. ACCURACY
2. SCALABILITY
3. SPEED
SPARK NLP IN THE ENTERPRISE
O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
• ”State of the art” means the best peer-reviewed academic results
• Public benchmarks: Comparing production-grade NLP libraries
• Public benchmarks of pre-trained models: nlp.johnsnowlabs.com
“Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching”
New: Redesigned NER-DL and BERT-large
New: Spark OCR image filters & scalable pipelines
New: Hierarchical clinical entity resolution
“Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis”
New: ALBERT & XLNet embeddings
New: Contextual spell checker
New: DL-based sentiment analysis
ACCURACY
SCALABILITY
• Zero code changes to scale a pipeline
to any Spark cluster
• Only natively distributed
open-source NLP library
• Spark provides execution planning,
caching, serialization, shuffling
• Caveats
– Speedup depends heavily on what
you actually do
– Not all algorithms scale well
– Spark configuration matters
SPEED: GET THE MOST FROM MODERN HARDWARE
• Optimized builds of Spark NLP
for both Intel and Nvidia
• Benchmark done on AWS:
Train a Named Entity Recognizer in
French
• Achieving F1-score of 89% requires
at least 80 Epochs with batch size of
512
• Intel outperformed Nvidia: Cascade
Lake was 19% faster & 46% cheaper
than Tesla P-100
Production Grade + Active Community
In production in multiple Fortune 500’s
26 new releases in 2018, 30 in 2019
Active Slack community
Permissive open source license: Apache 2.0
SPARK NLP
out-of-the-box
functionality
OFFICIALLY SUPPORTED RUNTIMES
Getting Things Done
SENTIMENT ANALYSIS
import sparknlp
sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze-sentiment', 'en')
result = pipeline.annotate('Harry Potter is a great movie’)
print(result['sentiment’]) ## will print ['positive']
NAMED ENTITY RECOGNITION
pipeline = PretrainedPipeline('recognize_entities_bert', 'en')
result = pipeline.annotate('Harry and Ron met in Hogsmeade')
print(result['ner'])
# prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
SPELL CHECKING & CORRECTION
Now in Scala:
val pipeline = PretrainedPipeline("spell_check_ml", "en")
val result = pipeline.annotate("Harry Potter is a graet muvie")
println(result("spell"))
/* will print Seq[String](…, "is", "a", "great", "movie") */
UNDER THE HOOD
1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it.
2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline,
the pre-trained models, and the embeddings it depends on.
3. These are stored and cached locally.
4. TensorFlow is initialized, within the same JVM process that runs Spark.
The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are
automatically distributed and shared if running on a cluster.
5. The annotate() call runs an NLP inference pipeline which activates each stage’s
algorithm (tokenization, POS, etc.).
6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for
tokens and a CNN for characters.
7. Embeddings are used to convert contextual tokens into vectors during the NER inference
process.
8. The result object is a plain old local Python dictionary.
KEY CONCEPT #1: PIPELINE
A list of text processing steps.
Each step has input and output columns.
Document
Assembler
Sentence
Detector
Tokenizer
Sentiment
Analyzer
text document sentence token sentiment
KEY CONCEPT #2: ANNOTATOR
sentiment_detector = SentimentDetector() 
.setInputCols(["sentence”]) 
.setOutputCol("sentiment_score") 
.setDictionary(resource_path+"sent.txt")
An object encapsulating one text processing step.
KEY CONCEPT #3: RESOURCE
• Trained ML models
• Trained DL networks
• Dictionaries
• Embeddings
• Rules
• Pretrained pipelines
An external file that an annotator needs.
Resources can be shared, cached, and locally stored.
KEY CONCEPT #4: PRETRAINED PIPELINE
A pre-built pipeline, with all the annotators and resources it needs.
PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT
Initialization
Training data
Resources
Annotator
Pipeline
Run Training
Notebooks!
Cleaning, Splitting, and Finding Text
+ Understanding Grammar
Run “Spark NLP Basics” notebook
Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
Using Pre-trained Pipelines
+ Named Entity Recognition
Run “Entity Recognizer with Deep Learning” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
Training your own NER model
Run “NER BERT Training” notebook
Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ
Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
Document Classification
+ Universal Sentence Embeddings
Run “Text Classification with ClassifierDL” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
WHAT ELSE IS AVAILABLE?
• Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking
to medical terminologies, assertion status detection, and de-identification
• Spark OCR: 20 annotators for image enhancement, layout, and smart editing
LEARN MORE: TECHNICAL CASE STUDIES
Improving Patient Flow Forecasting
Automated clinical coding & chart
reviews
Knowledge Extraction from
Pathology Reports
High-accuracy fact extraction
from long financial documents
Improving Mental Health for
HIV-Positive Adolescents
Accelerating Clinical Trial Recruiting
NEXT STEPS
1. READ THE DOCS & JOIN SLACK
HTTPS://NLP.JOHNSNOWLABS.COM
2. STAR & FORK THE REPO
GITHUB.COM/JOHNSNOWLABS/SPARK-NLP
3. QUESTIONS? GET IT TOUCH
Thank you!
alex@wisecube.com
david@johnsnowlabs.com
Advanced Natural Language Processing with Spark NLP

More Related Content

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Tfa collector docv121210
Tfa collector docv121210Tfa collector docv121210
Tfa collector docv121210Hanh Nguyen Duy
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsBithiah Yuan
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021Akshay Rai
 

What's hot (20)

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Tfa collector docv121210
Tfa collector docv121210Tfa collector docv121210
Tfa collector docv121210
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 

Similar to Advanced Natural Language Processing with Spark NLP

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat... Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningPatrick Nicolas
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introductionakira-ai
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Simplify Governance of Streaming Data
Simplify Governance of Streaming Data Simplify Governance of Streaming Data
Simplify Governance of Streaming Data confluent
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 

Similar to Advanced Natural Language Processing with Spark NLP (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat... Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Simplify Governance of Streaming Data
Simplify Governance of Streaming Data Simplify Governance of Streaming Data
Simplify Governance of Streaming Data
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Recently uploaded (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

Advanced Natural Language Processing with Spark NLP

  • 1.
  • 2. Advanced Natural Language Processing with Spark NLP Alex Thomas, Principal Data Scientist at WiseCube David Talby, CTO at John Snow Labs
  • 3. Agenda Introducing Spark NLP Accuracy, scalability, and speed benchmarks Out-of-the-box functionality Getting Things Done End-to-end NLP tasks in 3 lines of code Key concepts and a backstage tour Notebooks! Using pre-trained pipelines & models Named entity recognition Document classification
  • 4. INTRODUCING SPARK NLP STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA 1. ACCURACY 2. SCALABILITY 3. SPEED
  • 5. SPARK NLP IN THE ENTERPRISE O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
  • 6.
  • 7. • ”State of the art” means the best peer-reviewed academic results • Public benchmarks: Comparing production-grade NLP libraries • Public benchmarks of pre-trained models: nlp.johnsnowlabs.com “Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching” New: Redesigned NER-DL and BERT-large New: Spark OCR image filters & scalable pipelines New: Hierarchical clinical entity resolution “Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis” New: ALBERT & XLNet embeddings New: Contextual spell checker New: DL-based sentiment analysis ACCURACY
  • 8. SCALABILITY • Zero code changes to scale a pipeline to any Spark cluster • Only natively distributed open-source NLP library • Spark provides execution planning, caching, serialization, shuffling • Caveats – Speedup depends heavily on what you actually do – Not all algorithms scale well – Spark configuration matters
  • 9. SPEED: GET THE MOST FROM MODERN HARDWARE • Optimized builds of Spark NLP for both Intel and Nvidia • Benchmark done on AWS: Train a Named Entity Recognizer in French • Achieving F1-score of 89% requires at least 80 Epochs with batch size of 512 • Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  • 10. Production Grade + Active Community In production in multiple Fortune 500’s 26 new releases in 2018, 30 in 2019 Active Slack community Permissive open source license: Apache 2.0
  • 14. SENTIMENT ANALYSIS import sparknlp sparknlp.start() from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('analyze-sentiment', 'en') result = pipeline.annotate('Harry Potter is a great movie’) print(result['sentiment’]) ## will print ['positive']
  • 15. NAMED ENTITY RECOGNITION pipeline = PretrainedPipeline('recognize_entities_bert', 'en') result = pipeline.annotate('Harry and Ron met in Hogsmeade') print(result['ner']) # prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
  • 16. SPELL CHECKING & CORRECTION Now in Scala: val pipeline = PretrainedPipeline("spell_check_ml", "en") val result = pipeline.annotate("Harry Potter is a graet muvie") println(result("spell")) /* will print Seq[String](…, "is", "a", "great", "movie") */
  • 17. UNDER THE HOOD 1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it. 2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline, the pre-trained models, and the embeddings it depends on. 3. These are stored and cached locally. 4. TensorFlow is initialized, within the same JVM process that runs Spark. The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are automatically distributed and shared if running on a cluster. 5. The annotate() call runs an NLP inference pipeline which activates each stage’s algorithm (tokenization, POS, etc.). 6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for tokens and a CNN for characters. 7. Embeddings are used to convert contextual tokens into vectors during the NER inference process. 8. The result object is a plain old local Python dictionary.
  • 18. KEY CONCEPT #1: PIPELINE A list of text processing steps. Each step has input and output columns. Document Assembler Sentence Detector Tokenizer Sentiment Analyzer text document sentence token sentiment
  • 19. KEY CONCEPT #2: ANNOTATOR sentiment_detector = SentimentDetector() .setInputCols(["sentence”]) .setOutputCol("sentiment_score") .setDictionary(resource_path+"sent.txt") An object encapsulating one text processing step.
  • 20. KEY CONCEPT #3: RESOURCE • Trained ML models • Trained DL networks • Dictionaries • Embeddings • Rules • Pretrained pipelines An external file that an annotator needs. Resources can be shared, cached, and locally stored.
  • 21. KEY CONCEPT #4: PRETRAINED PIPELINE A pre-built pipeline, with all the annotators and resources it needs.
  • 22. PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT Initialization Training data Resources Annotator Pipeline Run Training
  • 24. Cleaning, Splitting, and Finding Text + Understanding Grammar Run “Spark NLP Basics” notebook Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
  • 25. Using Pre-trained Pipelines + Named Entity Recognition Run “Entity Recognizer with Deep Learning” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
  • 26. Training your own NER model Run “NER BERT Training” notebook Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
  • 27. Document Classification + Universal Sentence Embeddings Run “Text Classification with ClassifierDL” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
  • 28. WHAT ELSE IS AVAILABLE? • Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking to medical terminologies, assertion status detection, and de-identification • Spark OCR: 20 annotators for image enhancement, layout, and smart editing
  • 29. LEARN MORE: TECHNICAL CASE STUDIES Improving Patient Flow Forecasting Automated clinical coding & chart reviews Knowledge Extraction from Pathology Reports High-accuracy fact extraction from long financial documents Improving Mental Health for HIV-Positive Adolescents Accelerating Clinical Trial Recruiting
  • 30. NEXT STEPS 1. READ THE DOCS & JOIN SLACK HTTPS://NLP.JOHNSNOWLABS.COM 2. STAR & FORK THE REPO GITHUB.COM/JOHNSNOWLABS/SPARK-NLP 3. QUESTIONS? GET IT TOUCH