by John W. Mitchell
, Senior Correspondent | December 09, 2019
A panel presentation last Sunday at RSNA offered insights on how to access and assemble publicly available data sets for use in generating AI algorithms. The group also touched on points of caution around bias, validation, reference diagnosis, sample size, and other variables that can skew data to the deterrence of good AI development.
John Freymann, informatics manager at the National Cancer Institute, discussed The Cancer Imaging Archive (TCIA), which hosts and de-identifies more than 100 data set sources for use by researchers. He reviewed new data sets, including data generated by NCI/NIH grants, challenge competitions, and publication data sharing requests.
“More and more data sets are being built for AI performance,” Freymann told the audience.
TCIA has more than 15,000 active users per month, and nearly 900 peer-reviewed articles have been published based on TCIA. The data sets are widely used, said Freymann, because of their permissive Creative Commons licensing agreements. He also praised the increasing number of challenge competitions with curation from radiologists as valuable sources of new data for AI applications.
The second speaker, Dr. Laura Coombs, vice president of data science and informatics at the American College of Radiology, touched on the process and changes of several data quality areas. These included anonymization of data (removing individual patient information), ground truth (the reliability of data collected on-site), and federation (bringing the algorithm to the data securely rather than exporting data for development and exposing it to security risks).
Dr. Jayashree Kalpathy-Cramer, director of QTIM Lab at the Center for Machine Learning, MGH & BWH Center of Clinical Data Science, made a case for the expansion of public data sets by citing several points:
– There is, arguably, a reproducibility crisis in research.
– Very few publications have been validated using external data sets.
– Multi-institutional data sets are needed to build (validate) robust AI tools.
– Radiomics and learning methods can be "brittle", in that performance degrades when applied to data sets other than what the in-house program used to machine learn.
– Models built on limited internal databases can encode and propagate historic biases.
She added that building robust machine learning models requires large volumes of well-annotated data sets, and public data sets can improve reproducibility. However, the annotations need to be built into the models using common standards, as human annotations can be biased.