The success of AI will depend on validated public data sets

by John W. Mitchell, Senior Correspondent | December 09, 2019

Artificial Intelligence Health IT Risk Management

A panel presentation last Sunday at RSNA offered insights on how to access and assemble publicly available data sets for use in generating AI algorithms. The group also touched on points of caution around bias, validation, reference diagnosis, sample size, and other variables that can skew data to the deterrence of good AI development.

John Freymann, informatics manager at the National Cancer Institute, discussed The Cancer Imaging Archive (TCIA), which hosts and de-identifies more than 100 data set sources for use by researchers. He reviewed new data sets, including data generated by NCI/NIH grants, challenge competitions, and publication data sharing requests.

“More and more data sets are being built for AI performance,” Freymann told the audience.

Your Trusted Source for Sony Medical Displays, Printers & More!

Ampronix, a Top Master Distributor for Sony Medical, provides Sales, Service & Exchanges for Sony Surgical Displays, Printers, & More. Rely on Us for Expert Support Tailored to Your Needs. Email info@ampronix.com or Call 949-273-8000 for Premier Pricing.

TCIA has more than 15,000 active users per month, and nearly 900 peer-reviewed articles have been published based on TCIA. The data sets are widely used, said Freymann, because of their permissive Creative Commons licensing agreements. He also praised the increasing number of challenge competitions with curation from radiologists as valuable sources of new data for AI applications.

The second speaker, Dr. Laura Coombs, vice president of data science and informatics at the American College of Radiology, touched on the process and changes of several data quality areas. These included anonymization of data (removing individual patient information), ground truth (the reliability of data collected on-site), and federation (bringing the algorithm to the data securely rather than exporting data for development and exposing it to security risks).

Dr. Jayashree Kalpathy-Cramer, director of QTIM Lab at the Center for Machine Learning, MGH & BWH Center of Clinical Data Science, made a case for the expansion of public data sets by citing several points:

– There is, arguably, a reproducibility crisis in research.
– Very few publications have been validated using external data sets.
– Multi-institutional data sets are needed to build (validate) robust AI tools.
– Radiomics and learning methods can be "brittle", in that performance degrades when applied to data sets other than what the in-house program used to machine learn.
– Models built on limited internal databases can encode and propagate historic biases.

She added that building robust machine learning models requires large volumes of well-annotated data sets, and public data sets can improve reproducibility. However, the annotations need to be built into the models using common standards, as human annotations can be biased.



You Must Be Logged In To Post A Comment Sign In If you've already created an account, use your email address and password to sign in using the form below. Login Problems: Click here if you are having login issues. Email address: Password: Forgot your password? Login Problems? View our Legal Notice and Privacy Notice Register Registration is Free and Easy. Enjoy the benefits of The World's Leading New & Used Medical Equipment Marketplace. Register Now!