The speaker will review case studies from real-world projects that built AI systems using Natural Language Processing (NLP) in healthcare. These case studies cover projects that deployed automated patient risk prediction, automated diagnosis, clinical guidelines, and revenue cycle optimization.
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthcare AI Systems
1.
2. Spark NLP for Healthcare
Lessons Learned Building Real-World
Healthcare AI Systems
Veysel Kocaman
Sr. Data Scientist
3. Agenda
▪ Introducing Spark NLP
▪ Problem areas in healthcare
analytics
▪ Solving healthcare related NLP
problems
▪ Case studies
4. Introducing Spark NLP
● Natural Language Toolkit (NLTK): The complete toolkit
for all NLP techniques.
● TextBlob: Easy to use NLP tools API, built on top of NLTK
and Pattern.
● SpaCy: Industrial strength NLP with Python and Cython.
● Gensim: Topic Modelling for Humans
● Stanford Core NLP: NLP services and packages by
Stanford NLP Group.
● Fasttext: NLP library by Facebook’s AI Research (FAIR)
lab
● ...
● Spark NLP is an open-source natural language
processing library, built on top of Apache Spark and
Spark ML. (initial release: Oct 2017)
○ A single unified solution for all your NLP needs
○ Take advantage of transfer learning and
implementing the latest and greatest SOTA
algorithms and models in NLP research
○ Lack of any NLP library that’s fully supported by
Spark
○ Delivering a mission-critical, enterprise grade NLP
library (used by multiple Fortune 500)
○ Full-time development team (26 new releases in
2018. 30 new releases in 2019.)
https://medium.com/spark-nlp/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
7. Introducing Spark NLP
● Python, Java and Scala, R
● ”State of the art” means the best performing academic
peer-reviewed results
● Built on the Spark ML API’s
● Apache 2.0 Licensed
● Active development & support
● Zero code changes to scale a pipeline to any Spark
cluster
● The only open-source NLP library that is natively
distributed
● Spark provides execution planning, caching,
serialization, shuffling
14. Spark is like a locomotive racing a
bicycle. The bike will win if the load
is light, it is quicker to accelerate
and more agile, but with a heavy
load the locomotive might take a
while to get up to speed, but it’s
going to be faster in the end.
LightPipelines are Spark ML pipelines converted into a single
machine but multithreaded task, becoming more than 10x times
faster for smaller amounts of data (small is relative, but 50k
sentences is roughly a good maximum).
Spark NLP Light Pipelines
Faster inference in runtime from Spark
NLP pipelines
16. Spark NLP in Healthcare
Raw & unstructured dataClean & structured data Healthcare data
● Less than 50% of the structured data and less than 1% of the unstructured data is being leveraged for decision
making in companies (HBR). This is even worse in healthcare.
● NLP is ultra domain specific, so train your own models.
21. Spark NLP in Healthcare
NLP Library / Feature State of the Art (SOTA) Research
Named Entity Recognition “Entity Recognition from Clinical Texts via Recurrent Neural Network”.
Liu et al., BMC Medical Informatics & Decision Making, July 2017.
Word Embeddings - “How to Train Good Word Embeddings for Biomedical NLP”.
Chiu et al., In Proceedings of BioNLP’16, August 2016.
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”.
Devlin et. al. (Google Research), October 2018.
Assertion Status Detection - “Improving Classification of Medical Assertions in Clinical Notes”.
Kim et al., In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies, 2011.
- “Neural Networks For Negation Scope Detection“
Fancellu et al., In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, 2016.
Entity Resolution “CNN-based ranking for biomedical entity normalization”.
Li et al., BMC Bioinformatics, October 2017.
25. Clinical Assertion Model
Prescribing sick days due to diagnosis of influenza. Present
41 yo man with CRFs of DM Type II, high cholesterol, smoking history,
family hx, HTN p/w episodes of atypical CP x 1 week, with rest and
exertion.
Conditional
Jane’s RIDT came back clean. Absent
Jane is at risk for flu if she’s not vaccinated. Hypothetical
There was a dense hemianopsia on the left side. Present
“Neural Networks For Negation Scope Detection“
Fancellu et al., In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics, 2016.
scope of negation: given a negative instance, to identify which tokens are affected by negation
26. Clinical Assertion Model
scope of negation: given a negative instance, to identify which tokens are
affected by negation
27. Clinical Deidentification Model
* Identifies potential pieces of content with personal information about patients and remove them by replacing with semantic tags.
31. Customer Case Studies
1. How SelectData uses AI to better
understand home health patients
2. How Roche automated knowledge
extraction from pathology and radiology
reports
3. Improving patient flow forecasting at
Kaiser Permanente
4. How Deep6 accelerates clinical trial
recruitment
32. SelectData
What is Home Health and upcoming problems ?
Silver Tsunami
● By 2022 more than 25 percent of US workers will be 55 or older
● Nearly 10,000 baby boomers reach retirement age each day
● Home Health is expected to grow by 6.7% next year
Expert Reviewer
● Bureau of Labor Statistics projects that the need for medical coders will
increased by 15% by 2027
● Healthcare Data is used in decision-making
Aging Baby Boomers
● By 2039 the rate of Medicare spending and net interest on national debt will
exceed total projected revenues
● Payment reform focused on reduction in price
33. SelectData
Problems vs Solutions
TL;DR => we have more people, less qualified workers, and our clients are
receiving less money for the care of that patient.
34. SelectData
● OCR is difficult, different layouts, different
scales, noise, rotation.
● High number of records and pages.
● Need for cluster processing.
● Cluster processing is difficult.
36. SelectData
● We create a pipeline, composed by annotators.
● The pipeline runs in a cluster.
● We can process many documents in parallel and scale out.
43. Case 2: Roche
Manual curation is extremely time consuming, expensive,
and prone to errors
Manually Curated TCGA Report
Sample Results from Curation
44. Case 2: Roche
1. Natural Language Processing (NLP):
● High accuracy
● Specialized for medical data
● Minimize time to train new models
● Extensible for new content types
1. Optical Character Recognition (OCR):
● High accuracy
● Retain document structure
(i.e. tables, lists, paragraphs,...)
Requirements for both:
● Scalable (support 10 million pathology reports per
year)
● Compliant with privacy laws
● Integrates easily with AWS services
● Low cost
The NAVIFY team identified two significant needs
Action Plan :
● Initial goal of speeding up review of pathology
reports
● Will then automate extraction of high confidence
entities and relationships
● Will keep increasing automation of NLP over time
46. Case 2: Roche
Lessons Learned
● Extracting text from domain specific PDFs/images is unpredictable
● Quantitative evaluation of OCR is challenging
● Bridging the gap between domain knowledge & NLP requires consensus
● Evidence does not always match with standard terminologies
● Building NLP pipelines - that are generalizable:
○ Static components like tokenization, sentence detection, POS tagging and chunking can be
re-utilized
○ Data sources (hospitals) differ, NLP approach needs to be plug and play
47. Case 3: Kaiser Permanente
Improving Patient Flow Forecasting
48. Case 3: Kaiser Permanente
Improving Patient Flow Forecasting
Objectives
Optimize the patient flow models & provide insights,
for real-time decision-making and for strategic planning,
by predicting:
● Bed demand
● 'Safe' staffing levels
● Hospital gridlock
50. Case 4: Deep6
Feature engineering with Spark NLP to accelerate clinical trial recruitment
(reducing the time that it takes to find a patient for trials)
● Your treatments are > 15 years old
● Cutting edge treatments only
available in clinical trials
● Faster cycles make lifesaving
treatments available sooner