SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Evaluating Methods for the
Identification of Cancer in
Free-Text Pathology
Reports Using alternative
Machine Learning and Data
Preprocessing Approaches
Suranga Nath Kasthurirathne
What does that even mean ?
Our problem
• Cancer case reporting to public health
registries are often:
– Delayed
– Incomplete
Our emphasis
• Use pathology reports
• Automate it (It actually works !)
Our solution
• Speed
• Accuracy
• Applicability to other surveillance activities
• Computationally efficient
Issues
• Lots of data
• Lots of FREE-TEXT data
• Not enough time
• Not enough resources
Clarifications
When I say “We”:
• “We” in terms of decision making and
consultation usually means Dr. Grannis
• “We” in terms of implementation and code
mongering usually means Suranga
Our basic approach
Solution/s
What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision model combinations
• HOW to look for Vs. WHAT to look for
Manual review
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) positive cancer cases
Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,
carcinoma and ca
• Identification of negation context
• Use of alternate data input formats
What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?
• Raw
• Four state
So basically
Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs. Clustering analysis
To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
Knowledge Analysis)
– is a collection of machine learning algorithms
for data mining tasks
– is Open Source !
Decision models used
• Logistic regression
• Naïve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• JT48 J48 decision tree
(Thanks Jamie !!!)
Results
• How do we measure our results ?
– Precision
• What % of positive predictions were correct?
– Recall
• What % of positive cases were caught?
– Accuracy
• What % of predictions were correct?
Precision Vs. Recall. The fine balance
Results contd.…
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically significant
lower results for recall
• SVM and NB produced statistically
significant lower results for accuracy
Overall performance by
preprocessed input type
• Raw count is significantly better
than four state
Overall performance by decision
model
• Ensemble approach is significantly
better to individual algorithms
Improvements
Keywords ? sure, I have
a list…
Better identification of keywords
Shaun
Problems with Negex…
Results
• The funder is happy… we think
• We wrote an abstract !
• Feature selection approaches for keyword
identification as an independent study
rotation
Our thanks to…
• Dr. Shaun Grannis (RI)
• Dr. Brian Dixon (RI)
• Dr. Judy Wawira (IUPUI)
• Eric Durbin (UKC)
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Personalized Medicine with IBM-Watson: Future of Cancer care
Personalized Medicine with IBM-Watson: Future of Cancer carePersonalized Medicine with IBM-Watson: Future of Cancer care
Personalized Medicine with IBM-Watson: Future of Cancer care
jetweedy
 

Was ist angesagt? (20)

Phd thesis final presentation
Phd thesis   final presentationPhd thesis   final presentation
Phd thesis final presentation
 
Ch09
Ch09Ch09
Ch09
 
The comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesThe comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search engines
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
 
National governance of archetypes in Norway
National governance of archetypes in NorwayNational governance of archetypes in Norway
National governance of archetypes in Norway
 
Workshop on educating the workshop for openEHR implementation at Medinfo 2015
Workshop on educating the workshop for openEHR implementation at Medinfo 2015Workshop on educating the workshop for openEHR implementation at Medinfo 2015
Workshop on educating the workshop for openEHR implementation at Medinfo 2015
 
intro to quantitative
intro to quantitativeintro to quantitative
intro to quantitative
 
Data Science Salon: nterpretable Predictive Models in the Healthcare Domain
Data Science Salon: nterpretable Predictive Models in the Healthcare DomainData Science Salon: nterpretable Predictive Models in the Healthcare Domain
Data Science Salon: nterpretable Predictive Models in the Healthcare Domain
 
Implementing Point-of-Care PROMs
Implementing Point-of-Care PROMsImplementing Point-of-Care PROMs
Implementing Point-of-Care PROMs
 
Personalized Medicine with IBM-Watson: Future of Cancer care
Personalized Medicine with IBM-Watson: Future of Cancer carePersonalized Medicine with IBM-Watson: Future of Cancer care
Personalized Medicine with IBM-Watson: Future of Cancer care
 
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
 
Optimising Clinical Trials Monitoring Data review - Neill Barron
Optimising Clinical Trials Monitoring Data review - Neill BarronOptimising Clinical Trials Monitoring Data review - Neill Barron
Optimising Clinical Trials Monitoring Data review - Neill Barron
 
Data Analysis for All Students
Data Analysis for All StudentsData Analysis for All Students
Data Analysis for All Students
 
Advanced Marketing Management Class 9
Advanced Marketing Management Class 9Advanced Marketing Management Class 9
Advanced Marketing Management Class 9
 
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
 
Strategies for success - volunteer recruitment & prevention of over-volunteer...
Strategies for success - volunteer recruitment & prevention of over-volunteer...Strategies for success - volunteer recruitment & prevention of over-volunteer...
Strategies for success - volunteer recruitment & prevention of over-volunteer...
 
Hm 418 harris ch09 ppt
Hm 418 harris ch09 pptHm 418 harris ch09 ppt
Hm 418 harris ch09 ppt
 
Pitfalls and realities of working with Big Data
Pitfalls and realities of working with Big DataPitfalls and realities of working with Big Data
Pitfalls and realities of working with Big Data
 
Expert system mycin
Expert system   mycinExpert system   mycin
Expert system mycin
 

Andere mochten auch (8)

gsoc and grub4ext4
gsoc and grub4ext4gsoc and grub4ext4
gsoc and grub4ext4
 
Contributing to Open Source & GSoC
Contributing to Open Source & GSoCContributing to Open Source & GSoC
Contributing to Open Source & GSoC
 
Gsoc 2011 suranga
Gsoc 2011 suranga Gsoc 2011 suranga
Gsoc 2011 suranga
 
FHIR for OpenMRS: How, what and Why (Maputo 2015, Lightning talks)
FHIR for OpenMRS: How, what and Why (Maputo 2015, Lightning talks)FHIR for OpenMRS: How, what and Why (Maputo 2015, Lightning talks)
FHIR for OpenMRS: How, what and Why (Maputo 2015, Lightning talks)
 
Snk fhir-for-OpenMRS-wip-07102014
Snk fhir-for-OpenMRS-wip-07102014Snk fhir-for-OpenMRS-wip-07102014
Snk fhir-for-OpenMRS-wip-07102014
 
Gsoc 2013-sliit
Gsoc 2013-sliitGsoc 2013-sliit
Gsoc 2013-sliit
 
Expanding on obs
Expanding on obsExpanding on obs
Expanding on obs
 
GSoC: How to get prepared and write a good proposal (or how to start contribu...
GSoC: How to get prepared and write a good proposal (or how to start contribu...GSoC: How to get prepared and write a good proposal (or how to start contribu...
GSoC: How to get prepared and write a good proposal (or how to start contribu...
 

Ähnlich wie Sk ghi (wip) 22052014

Ähnlich wie Sk ghi (wip) 22052014 (20)

Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Information Products to Drive Decision Making
Information Products to Drive Decision  MakingInformation Products to Drive Decision  Making
Information Products to Drive Decision Making
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
Machine learning in disease diagnosis
Machine learning in disease diagnosisMachine learning in disease diagnosis
Machine learning in disease diagnosis
 
Investigating Performance: Design & Outcomes with xAPI | LSCon 2017
Investigating Performance: Design & Outcomes with xAPI | LSCon 2017Investigating Performance: Design & Outcomes with xAPI | LSCon 2017
Investigating Performance: Design & Outcomes with xAPI | LSCon 2017
 
Introduction to Data Management in Human Ecology
Introduction to Data Management in Human EcologyIntroduction to Data Management in Human Ecology
Introduction to Data Management in Human Ecology
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Diagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography DataDiagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography Data
 
Not fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational valuesNot fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational values
 
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22
 
Not fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational valuesNot fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational values
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
 
Data Analysis by Ananthu.A.Ghosh.pptx
Data Analysis by Ananthu.A.Ghosh.pptxData Analysis by Ananthu.A.Ghosh.pptx
Data Analysis by Ananthu.A.Ghosh.pptx
 
Workshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate Level
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
How to Structure the “Approach” Section of a Grant Application by David Elash...
How to Structure the “Approach” Section of a Grant Application by David Elash...How to Structure the “Approach” Section of a Grant Application by David Elash...
How to Structure the “Approach” Section of a Grant Application by David Elash...
 
How to Structure the “Approach” Section of a Grant Application (2020)
How to Structure the “Approach” Section of a Grant Application (2020)How to Structure the “Approach” Section of a Grant Application (2020)
How to Structure the “Approach” Section of a Grant Application (2020)
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 

Mehr von Suranga Nath Kasthurirathne (8)

Interoperability, the rise of HL7 and FHIR
Interoperability, the rise of HL7 and FHIRInteroperability, the rise of HL7 and FHIR
Interoperability, the rise of HL7 and FHIR
 
Aehin 2016 backup
Aehin 2016 backupAehin 2016 backup
Aehin 2016 backup
 
Pgim 2016-finalized
Pgim 2016-finalizedPgim 2016-finalized
Pgim 2016-finalized
 
Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0
 
Decision Modelling for n00bs
Decision Modelling for n00bsDecision Modelling for n00bs
Decision Modelling for n00bs
 
Ghi diagnostic-reports
Ghi diagnostic-reportsGhi diagnostic-reports
Ghi diagnostic-reports
 
Powerpoint Karaoke, Maputo 2015
Powerpoint Karaoke, Maputo 2015Powerpoint Karaoke, Maputo 2015
Powerpoint Karaoke, Maputo 2015
 
The open mrs hl7query module
The open mrs hl7query moduleThe open mrs hl7query module
The open mrs hl7query module
 

Kürzlich hochgeladen

Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 

Kürzlich hochgeladen (20)

Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 

Sk ghi (wip) 22052014

  • 1. Evaluating Methods for the Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches Suranga Nath Kasthurirathne
  • 2. What does that even mean ?
  • 3. Our problem • Cancer case reporting to public health registries are often: – Delayed – Incomplete
  • 4. Our emphasis • Use pathology reports • Automate it (It actually works !) Our solution • Speed • Accuracy • Applicability to other surveillance activities • Computationally efficient
  • 5. Issues • Lots of data • Lots of FREE-TEXT data • Not enough time • Not enough resources
  • 6. Clarifications When I say “We”: • “We” in terms of decision making and consultation usually means Dr. Grannis • “We” in terms of implementation and code mongering usually means Suranga
  • 8. Solution/s What improvements are we trying out? • Alternative data input formats • Candidate decision models • Decision model combinations • HOW to look for Vs. WHAT to look for
  • 9. Manual review • Functions as our source of truth – What ? – Why ? Manually reviewed 1495 reports Identified 371 (24.8%) positive cancer cases
  • 10. Machine learning process • Identification of keywords – What ARE keywords ? Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca • Identification of negation context • Use of alternate data input formats
  • 11. What were the different data input formats used ? • Raw data input • Four state data input What and Why ?
  • 14. Training / Testing • What ? • Why cross validation ? • Alternative decision models – So many options ! – Classification vs. Clustering analysis
  • 15. To preserve my sanity, and because we’re not stupid… • We used Weka (Waikato Environment for Knowledge Analysis) – is a collection of machine learning algorithms for data mining tasks – is Open Source !
  • 16. Decision models used • Logistic regression • Naïve Bayes • Support vector machine • K-nearest neighbor • Random forest • JT48 J48 decision tree (Thanks Jamie !!!)
  • 17.
  • 18. Results • How do we measure our results ? – Precision • What % of positive predictions were correct? – Recall • What % of positive cases were caught? – Accuracy • What % of predictions were correct? Precision Vs. Recall. The fine balance
  • 19. Results contd.… • RF and NB showed statistically significant lower values for precision • SVM exhibited statistically significant lower results for recall • SVM and NB produced statistically significant lower results for accuracy
  • 20. Overall performance by preprocessed input type • Raw count is significantly better than four state
  • 21. Overall performance by decision model • Ensemble approach is significantly better to individual algorithms
  • 23. Keywords ? sure, I have a list… Better identification of keywords Shaun
  • 25. Results • The funder is happy… we think • We wrote an abstract ! • Feature selection approaches for keyword identification as an independent study rotation
  • 26. Our thanks to… • Dr. Shaun Grannis (RI) • Dr. Brian Dixon (RI) • Dr. Judy Wawira (IUPUI) • Eric Durbin (UKC)

Hinweis der Redaktion

  1. Explain title
  2. Describe actual problem, speak of registries, physician resources etc.
  3. What are we focusing on / I.e. what are we concerned about ?
  4. Who actually did what
  5. General flowchart showing what happens
  6. How we’re trying to solve things