An MIT Computer Science and Artificial Intelligence Lab (CSAIL) team claims to have developed an AI system that can analyze X-rays to anticipate certain kinds of heart failure. By detecting signs of excess fluid in the lungs, a condition known as pulmonary edema, the researchers say it can quantify heart failure severity on a four-level scale correctly more than half the time.
Every year, roughly 12.5% of deaths in the U.S. are caused at least in part by heart failure, according to the U.S. Centers for Disease Control and Prevention. One of acute heart failure’s most common warning signs is edema; a patient’s exact level of excess fluid often dictates a doctor’s course of action. But making these determinations is difficult and requires clinicians to rely on subtle features in X-rays, which can sometimes lead to inconsistent diagnoses and treatment plans.
To overcome this challenge, the CSAIL team developed an AI model that jointly learns from a large number of chest radiographs and their associated radiology reports, with a limited number of edema severity labels and text from the reports’ “impressions,” “findings,” “conclusion,” “recommendation,” and “final report” sections. (Severity labels range from 0 to 3, with 3 indicating the severest condition.) At inference time, the model computes edema severity given the input image, even making predictions from the reports themselves.
The system had to be designed to handle varying tones and a range of terminology, accounting for radiologists’ unique writing styles. In a step toward this, the researchers developed a set of linguistic rules and substitutions, ensuring that data could be analyzed consistently across reports even when the reports lack labels for the edema severity.
In training the model, the researchers sourced data from the open source MIMIC-CXR dataset, which contains over 377,110 chest radiographs associated with 227,835 radiology reports. After extracting severity labels from the associated files, filtering for keywords from other disease processes, and limiting label extraction to patients with congestive heart failure, the researchers were left with a training dataset of 247,425 image-text pairs.
To evaluate the model, the researchers randomly selected hundreds of image-text pairs and had a board-certified radiologist and domain experts review and correct the labels of the reports. After training on both images and text, the system achieved 90% accuracy when classifying level-3 pulmonary edemas and 82% and 81% accuracy, respectively, when classifying level-1 and level-2 edemas.
In collaboration with Beth Israel Deaconess Medical Center (BIDMC) and Philips, the team plans to integrate their system into BIDMC’s emergency room workflow this fall. They hope the annotations of the severity labels, which were agreed upon by a team of four radiologists, can serve as a universal standard to benchmark future machine learning development.
“Our model can turn both images and text into compact numerical abstractions from which an interpretation can be derived,” Ph.D. student and coauthor Geeticka Chauhan said. “We trained it to minimize the difference between the representations of the X-ray images and the text of the radiology reports, using the reports to improve the image interpretation … These correlations will be valuable for improving search through a large database of X-ray images and reports, to make retrospective analysis even more effective.”