Largest multi-lesion CT imaging dataset, DeepLesion, available to public

July 23, 2018
by Thomas Dworetzky, Contributing Reporter
The new publicly-accessible medical imaging database, DeepLesion, is a “critical step forward in computer-aided radiology detection, diagnosis, and deep learning,” according to the paper announcing its availability in the Journal of Medical Imaging.

It is the largest CT lesion-image database ever made available to the public, with over 32,000 annotated lesions from over 10,000 cases, according to the team from the National Institutes of Health Clinical Center that developed it. Such huge, annotated radiological datasets are essential in the creation of deep learning approaches to medical data.

"We hope the data set will benefit the medical imaging area just as ImageNet benefited the computer vision area," said Ke Yan, the lead author on the paper and a postdoctoral fellow with senior author Dr. Ronald Summers, senior investigator and staff radiologist at the center.

DeepLesion was creating by “mining” historical medical data from the Institute's own Picture Archiving and Communication System (PACS).

“This new dataset has tremendous potential to jump-start the field of computer-aided detection (CADe) and diagnosis (CADx),” according to the release.

DeepLesion differs from most other medical image datasets now available, which are only able to spot one type of lesion, according to the NIH in a statement.

When examining CT images radiologists at the Clinical Center measure and mark clinically significant findings using “electronic bookmarks”, which can be complex and include arrows, lines, diameters, and text to pinpoint the tumor's location and size, to enable experts to spot growth or new disease.

“The bookmarks, abundant with retrospective medical data, are what scientists used to develop the DeepLesion dataset,” stated the NIH, noting that unlike most other datasets, DeepLesion has great diversity, with “all kinds of critical radiology findings from across the body, such as lung nodules, liver tumors, enlarged lymph nodes, and so on.”

The lack of such a multiple category lesion data set “has been a major roadblock to development of more universal CADe frameworks capable of detecting multiple lesion types.

This new multi-category dataset could “even enable development of CADx systems that automate radiological diagnosis,” according to the statement.

The team also created a universal lesion detector from their work on DeepLesion, and noted that while detection is time-consuming for radiologists, it is crucial to diagnosis. The thought is that this detector could be used in the future for screening by either radiologists or other CADe systems.

Future plans call for improving accuracy of detection via the use of 3D and lesion type information, expanding the data set to include data from other institutions, and also to enlarge the data set with MR images.

The NIH Clinical Center made news in 2017 when it released more than 100,000 anonymized chest x-rays and corresponding data “to allow researchers across the country and around the world to freely access the datasets and increase their ability to teach computers how to detect and diagnose disease,” it said at that time.

That data set, known as “ChestX-ray8”, came from scans on over 30,000 patients and was also released by Summers and an NIH team.

“Building truly large-scale, fully automated, high-precision medical diagnosis systems remains a strenuous task,” stated lead author Xiaosong Wang and senior author Summers in the paper presenting the data set. They noted that “ChestX-ray8” can enable the data-hungry deep neural network paradigms to create clinically meaningful applications, including common disease pattern mining, disease correlation analysis and automated radiological report generation.”

The DeepLesion database can be downloaded at https://nihcc.box.com/v/DeepLesion.