Weitere ähnliche Inhalte
Ähnlich wie Near Duplicate Detection for Medical Imaging Data Warehouse Construction (20)
Mehr von Pradeeban Kathiravelu, Ph.D. (20)
Kürzlich hochgeladen (20)
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
- 1. RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
Introduction
Distributed Near Duplicate Detection
●
Integrate medical data from various heterogeneous medical data sources and private
archives using the public APIs.
●
Curate the integrated data into a data warehouse for public access.
●
Store the detected duplicate pairs into a separate data source.
●
Duplicate detection by analyzing the potential data pairs from the original data sources,
using similarity matrices for textual data.
●
Hierarchical meta data attached to the binary medical data to identify, classify, and find
duplicates among the binary raw data.
●
Considers the inconsistencies in representation.
– Usage of acronyms instead of the full form of the attributes.
– Using different measurement units.
●
Data is published to various data sources by the medical data publishers
– through the respective write APIs of the data sources.
●
Connects to the original data sources through their read APIs.
●
Output of consolidated data and duplicate pairs
– stored through the relevant write APIs.
●
Medical data consumers consume the data from the warehouse composed by MediCurator
through its read API.
●
The data warehouse is considered to be free from the duplicates
– False positives and false negatives.
– based on the effectiveness of the similarity matrices and similarity join algorithms used.
References
●
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-
duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15.
●
"Kathiravelu, Pradeeban; Galhardas, Helena; Veiga, Luís; ",∂u∂u Multi-Tenanted Framework:
Distributed Near Duplicate Detection for Big Data, On the Move to Meaningful Internet
Systems: OTM 2015 Conferences, 237-256, 2015, Springer International Publishing
●
"Kathiravelu, Pradeeban; Sharma, Ashish;", MEDIator: A Data Sharing Synchronization
Platform for Heterogeneous Medical Image Archives, "Workshop on Connected Health at Big
Data Era (BigCHat'15) , co-located with 21 st ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD 2015)", 2015, ACM.
●
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle
M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public
Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013,
pp 1045-1057.
●
Hazelcast for a distributed near duplicate detection.
●
Meta Data attached to the binary images in Medical Image Archives
– The Cancer Imaging Archive (TCIA)
●
●
●
●
●
●
●
●
●
●
●
●
Pradeeban Kathiravelu Ashish Sharma
Medical Imaging Data Warehouse Construction
Near Duplicate Detection for
●
Medical data warehouses and image archives are constructed by integrating multiple private
and public data sources.
●
Finding almost identical entries is crucial for warehouse construction.
●
Medical image archives are huge and consist of structured and hierarchical data, which may
be accessed by querying the metadata.
●
Existing solutions tend to be too specific.
– Master Patient Index (MPI) for patient records.
●
Multiple dimensions and attributes
– including medications, clinical, and pathological data
– should be considered for a complete duplicate detection and elimination.
●
MediCurator is a near duplicate detection framework for heterogeneous medical data
sources in constructing data warehouses.
●
MediCurator has been developed to retrieve medical data from
– various data sources, including: MySQL, MongoDB, CSV files, and
– medical image archives such as TCIA
●
MediCurator fits as part of the ETL process.
– Duplicates are detected in-memory.
– Merged data stored into data warehouses hosted in Hadoop Distributed File System
(HDFS).
MediCurator Approach
Design
Implementation
●
A prototype has been implemented.
– Hazelcast as the distributed execution framework.
– Distributed execution of research near duplicate detection algorithms on metadata.
– Speed-up of ten-folds, compared to the existing solutions such as MPI systems.
●
MediCurator functions as an integration middleware
– for data warehouse construction
– with duplicate detection and elimination
– from the raw textual medical data, or the binary data by leveraging the meta data
attached to it.
●
{pkathi2, ashish.sharma} @ emory.edu
Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA.
Acknowledgments
* Google Summer of Code 2015
* NCI U01 [1U01CA187013-01], Resources for development and validation of
Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)