This document discusses data mining of radiology reports to structure unstructured text for further analysis. Over 500,000 de-identified radiology reports containing over 36 million words were annotated by experts to assign sentences to categories called propositions. So far over 427,000 unique sentences have been annotated, representing 60% of total sentences. The structured data is stored in a database and can be analyzed to find frequent findings and compare normal vs. abnormal results. Similar prior works are discussed but the large scale of this dataset and expert validation sets it apart.
1. Data Mining in Radiology Reports SaeedMehrabi Spring 2010INFO-I535 Dr. Patrick W. Jamieson Dr. Josette Jones
2. Outline Introduction to data and text mining Our data set Structuring free text Results Similar works Discussion
3. What is Data Mining Data mining is The extraction of useful patterns from data sources such as databases, texts and web. There is a big gap from stored data to knowledge and the transition won’t occur automatically. Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion”
4. Why data mining now? The data is abundant. The data is being warehoused. The computing power is affordable. The competitive pressure is strong. Data mining tools have become available
5. Text Mining Text mining applies and adapts data mining techniques to text domain Structured vs. Free Text Structured text can be stored in a relational database. Providing the means to represent data available in text in structured format will make information exchange, data mining and information retrieval more feasible.
6. Data Set Our corpus consists of: 594,000 de-identified radiology reports 36 million words 4.3 million sentences The reports were dictated by the Indiana University Radiology faculty, a group of 40 radiologists, from 1993-1998.
7. Structuring Free text Regular expression was used to detect sentences in reports! Regular expression is a concise and flexible way of matching strings of text, such as particular characters or words. Sentences annotated to propositions which simply are sentences expressing the same concept for similar findings within reports
8. Structuring Free text (Cont.) A proposition is a declarative sentence, that is either true or false but not both. Today is a beautiful sunny day. ( A proposition) x + 2 = 4 (Not a proposition) Users can select propositions and map sentences to propositions
9.
10. Corpus Annotation So for annotating each new sentence from the radiology reports the computer initially propose propositions The suggested propositions by the software are reviewed by experts and corrected as needed before validation. If there is no proposition in the ontology then the expert can create new ones.
11.
12. Results The process of building the ontology of propositions is in parallel with the expert annotating sentences to the existing proposition So far, 427,433 unique sentences from the corpus have been annotated. Representing a total of 2,561,330 sentences or 60% of the total sentences.
13. Results (Cont.) The propositions are categorized into main findings such as brain and skull, general radiology, .. All propositions with information such as whether they are normal or abnormal finding and the number of the sentences mapped to them are all stored in a relational data base We can find the most frequent or highest ranked propositions by sorting them based the number of sentences that are mapped to them, how many of them are normal or abnormal and the number of normal and abnormal propositions and sentences in each category
14.
15.
16.
17.
18.
19. Similar works CLEF (Clinical E-Science Framework) It consists of both structured records and free text documents(clinical narratives, radiology reports and histopathology report) Semantic annotation of clinical text to assist in the development and evaluation of an Information Extraction system
21. LEXIMER(Cont.) Phrase Isolation includes scanning the report text and separating the content into phrases Noise Reduction decreases the amount of non-clinically relevant information contained within the report Signal Extraction pulls out the positive statements and recommendations from the clinically relevant phrases
22. NLP using OLAP for assessing Recommendations in radiology reports Database: 4,279,179 radiology reports from a single tertiary health care center 10-year period (1995-2004) Consist of reports of most common imaging modalities tests with patient demographics Leximerin conjunction with OnLine Analytic Processing was used for classifying reports into those with recommendation (IREC) and without recommendations for imaging IREC rates were determined for different patient age groups, gender, imaging modalities, indications, diseases, subspecialties, and referring physicians
23. Discussion CLEF work is on very limited number of reports In Leximer, there is no validation of their classification method and phrases cannot convey the meaning of a sentence. What distinguish our work from others is the large amount of data that is mined and consistent expert validation.
24. Reference Friedlin, J., Mahoui, M., Jones, J., Kashyap, V., & Jamieson , P. (2010).Knowledge Discovery and Data Mining of Free Text Radiology.Submitted to the journal of biomedical informatics Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Setzer, A., et al. (2008). Semantic Annotation of Clinical Text: The CLEF Corpus. Retrieved April 20, 2010, from ftp://ftp.dcs.shef.ac.uk/home/robertg/papers/lrec08-clefcorpus.pdf Dang PA, Kalra MK, Blake MA, Schultz TJ, Stout M, Lemay PR, Freshman DJ, Halpern EF, Dreyer KJ. Natural language processing using online analytic processing for assessing recommendations in radiology reports.J Am CollRadiol. 2008 Mar;5(3):197-204. http://www.nuance.com/healthcare/products/radcube-for-radiology.asp