Presentation on Evaluating Methods for the Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches
1. Evaluating Methods for the
Identification of Cancer in
Free-Text Pathology
Reports Using alternative
Machine Learning and Data
Preprocessing Approaches
Suranga Nath Kasthurirathne
3. Our problem
• Cancer case reporting to public health
registries are often:
– Delayed
– Incomplete
4. Our emphasis
• Use pathology reports
• Automate it (It actually works !)
Our solution
• Speed
• Accuracy
• Applicability to other surveillance activities
• Computationally efficient
5. Issues
• Lots of data
• Lots of FREE-TEXT data
• Not enough time
• Not enough resources
6. Clarifications
When I say “We”:
• “We” in terms of decision making and
consultation usually means Dr. Grannis
• “We” in terms of implementation and code
mongering usually means Suranga
8. Solution/s
What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision model combinations
• HOW to look for Vs. WHAT to look for
9. Manual review
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) positive cancer cases
10. Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,
carcinoma and ca
• Identification of negation context
• Use of alternate data input formats
11. What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?
14. Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs. Clustering analysis
15. To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
Knowledge Analysis)
– is a collection of machine learning algorithms
for data mining tasks
– is Open Source !
16. Decision models used
• Logistic regression
• Naïve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• JT48 J48 decision tree
(Thanks Jamie !!!)
17.
18. Results
• How do we measure our results ?
– Precision
• What % of positive predictions were correct?
– Recall
• What % of positive cases were caught?
– Accuracy
• What % of predictions were correct?
Precision Vs. Recall. The fine balance
19. Results contd.…
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically significant
lower results for recall
• SVM and NB produced statistically
significant lower results for accuracy
25. Results
• The funder is happy… we think
• We wrote an abstract !
• Feature selection approaches for keyword
identification as an independent study
rotation
26. Our thanks to…
• Dr. Shaun Grannis (RI)
• Dr. Brian Dixon (RI)
• Dr. Judy Wawira (IUPUI)
• Eric Durbin (UKC)