Back in January, Google Health, the branch of Google focused on health-related research, clinical tools, and partnerships for health care services, released an AI model trained on over 90,000 mammogram X-rays that the company said achieved better results than human radiologists. Google claimed that the algorithm could recognize more false negatives — the kind of images that look normal but contain breast cancer — than previous work, but some clinicians, data scientists, and engineers take issue with that statement. In a rebuttal published today in the journal Nature, over 19 coauthors affiliated with McGill University, the City University of New York (CUNY), Harvard University, and Stanford University said that the lack of detailed methods and code in Google’s research “undermines its scientific value.”
Science in general has a reproducibility problem — a 2016 poll of 1,500 scientists reported that 70% of them had tried but failed to reproduce at least one other scientist’s experiment — but it’s particularly acute in the AI field. At ICML 2019, 30% of authors failed to submit their code with their papers by the start of the conference. Studies often provide benchmark results in lieu of source code, which becomes problematic when the thoroughness of the benchmarks comes into question. One recent report found that 60% to 70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers. Another study — a meta-analysis of over 3,000 AI papers — found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.
In their rebuttal, the coauthors of the Nature commentary point out that Google’s breast cancer model research lacks details, including a description of model development as well as the data processing and training pipelines used. Google omitted the definition of several hyperparameters for the model’s architecture (the variables used by the model to make diagnostic predictions), and it also didn’t disclose the variables used to augment the dataset on which the model was trained. This could “significantly” affect performance, the Nature coauthors claim; for instance, it’s possible that one of the data augmentations Google used resulted in multiple instances of the same patient, biasing the final results.
“On paper and in theory, the [Google] study is beautiful,” Dr. Benjamin Haibe-Kains, senior scientist at Princess Margaret Cancer Centre and first author of the Nature commentary, said. “But if we can’t learn from it then it has little to no scientific value … Researchers are more incentivized to publish their finding rather than spend time and resources ensuring their study can be replicated … Scientific progress depends on the ability of researchers to scrutinize the results of a study and reproduce the main finding to learn from.”
For its part, Google said that the code used to train the model had a number of dependencies on internal tooling, infrastructure, and hardware, making its release infeasible. The company also cited the two training datasets’ proprietary nature (both were under license) and the sensitivity of patient health data in its decision not to release them. But the Nature coauthors note that the sharing of raw data has become more common in biomedical literature, increasing from under 1% in the early 2000s to 20% today, and that the model predictions and data labels could have been released without compromising personal information.
“[Google’s] multiple software dependencies of large-scale machine learning applications require appropriate control of software environment, which can be achieved through package managers including Conda, as well as container and virtualization systems, including Code Ocean, Gigantum, and Colaboratory,” the coauthors wrote in Nature. “If virtualization of the internal tooling proved to be difficult, [Google] could have released the computer code and documentation. The authors could also have created toy examples to show how new data must be processed to generate predictions.”
The Nature coauthors make the assertion that for efforts where human lives are at stake — as would be the case for Google’s model were it to be deployed in a clinical setting — there should be a high bar for transparency. If data can’t be shared with the community because of licensing or other insurmountable issues, they wrote, a mechanism should be established so that trained, independent investigators can access the data and verify the analyses, allowing peer-review of the study and its evidence.
“We have high hopes for the utility of AI methods in medicine,” they wrote. “Ensuring that these methods meet their potential, however, requires that these studies be reproducible.”
Indeed, partly due to a reticence to release code, datasets, and techniques, much of the data used today to train AI algorithms for diagnosing diseases may perpetuate inequalities. A team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, Stanford University researchers claimed that most of the U.S. data for studies involving medical uses of AI come from California, New York, and Massachusetts. A study of a UnitedHealth Group algorithm determined that it could underestimate the number of Black patients in need of greater care by half. And a growing body of work suggests that skin cancer-detecting algorithms tend to be less precise when used on Black patients, in part because AI models are trained mostly on images of light-skinned patients.
Beyond basic dataset challenges, models lacking sufficient peer-review can encounter unforeseen roadblocks when deployed in the real world. Scientists at Harvard found that algorithms trained to recognize and classify CT scans could become biased to scan formats from certain CT machine manufacturers. Meanwhile, a Google-published whitepaper revealed challenges in implementing an eye disease-predicting system in Thailand hospitals, including issues with scan accuracy. And studies conducted by companies like Babylon Health, a well-funded telemedicine startup that claims to be able to triage a range of diseases from text messages, have been repeatedly called into question.
“If not properly addressed, propagating these biases under the mantle of AI has the potential to exaggerate the health disparities faced by minority populations already bearing the highest disease burden,” wrote the coauthors of a recent paper in the Journal of American Medical Informatics Association, which argued that biased models may further the disproportionate impact the coronavirus pandemic is having on people of color. “These tools are built from biased data reflecting biased healthcare systems and are thus themselves also at high risk of bias — even if explicitly excluding sensitive attributes such as race or gender.”
The Nature coauthors advocate for third-party validation of medical models at all costs. Failure to do so, they said, could reduce its impact and lead to unintended consequences. “Unfortunately, the biomedical literature is littered with studies that have failed the test of reproducibility, and many of these can be tied to methodologies and experimental practices that could not be investigated due to failure to fully disclose software and data,” they wrote. “The failure of [Google] to share key materials and information transforms their work from a scientific publication open to verification into a promotion of a closed technology.”