迎接医疗保健人工智能偏见的挑战

2021
12/15

+
分享
评论
NursingResearch护理研究前沿
A-
A+

基于人工智能的模型可能会放大数据集中预先存在的人类偏见;解决这个问题需要从根本上重新调整软件开发文化。

Full text

AI-based models may amplify pre-existing human bias within datasets; addressing this problem will require a fundamental realignment of the culture of software development.

In artificial intelligence (AI)-based predictive models, bias—defined as unfair systematic error—is a growing source of concern, particularly in healthcare applications. Especially problematic is the unfairness that arises from unequal distribution of error among groups that are vulnerable to harm, historically subject to discrimination or socially marginalized.

57961639566137769  

Credit: Marina Spence

In this issue of Nature Medicine, Seyyed-Kalantari and colleagues1 examine three large, publicly available radiology datasets to demonstrate a specific type of bias in AI-based chest X-ray prediction models. They found that these models are more likely to falsely predict that patients are healthy if they are members of underserved populations, even when using classifiers that are based on state-of-the-art computer vision techniques. In other words, they identified an underdiagnosis bias, which is especially ethically problematic because it would wrongly categorize already underserved patients as not in need of treatment—thereby exacerbating existing health disparities.

The authors found consistent underdiagnosis of female patients, patients under 20 years old, Black patients, Hispanic patients and patients with Medicaid insurance (who are typically of lower socioeconomic status), as well as intersectional subgroups. They note that, although examples of underdiagnosis of underserved patients have already been identified in several areas of clinical care, predictive models are likely to amplify this bias. In addition, the shift towards automated natural language processing (NLP)-based labeling, which is also known to show bias against under-represented populations, could contribute to differences in underdiagnosis among underserved groups. This study therefore sheds light on an important but relatively understudied type of bias in healthcare, and raises bigger questions of how such bias arises and how it can be minimized.

The authors make several recommendations to mitigate underdiagnosis through considerations in the AI development process. For example, they suggest that automatic labeling from radiology reports using NLP should be audited. They also note the tradeoffs between equity (through achieving equal false negative rates (FNR) and false positive rates (FPR)) and model performance. However, in asking the question of whether “worsening overall model performance on one subgroup in order to achieve equality is ethically desirable,” the authors also explicitly frame that tradeoff as one of values, as well as technical considerations.

Clinicians’ values are reflected in the choice of the binarized metrics FPR and FNR as opposed to area under the curve (AUC), which prioritizes the type of prediction that is most useful for clinical decision-making. For diagnostic tests, AUC is a single metric that represents, for example, the likelihood that the test will correctly rank a patient with a lesion and one without, across all diagnostic thresholds such as ‘benign’ or ‘definitely cancer’. However, it averages across thresholds, even those that are not clinically relevant, and is uninformative about relative sensitivity and specificity, treating them as equally important. The dangers of optimizing AI models for the wrong task by failing to recognize or take into account patient values are very real in the healthcare setting; for a patient, the implications of a false positive versus a false negative for a malignancy are not equivalent2. Human diagnosticians recognize the difference in misclassification costs, and ‘err on the side of caution’3. However, performance metrics that do not account for real-world impacts and what is important to patients and clinicians will be misleading. In addition, the clinician’s need for information on causal inference in order to take action and the limitations of data-driven models to provide such information must be acknowledged4.

The example of the Epic Sepsis Model (ESM) highlights some of the implications of decisions made at different stages of development. The model is a tool included in Epic Systems’ electronic health record platform that predicts the probability of sepsis. The ESM has drawn criticism because of its poor performance in some health systems, which was characterized as “substantially worse” than what was reported by the developer (Epic Systems)5. However, the developer neither evaluated the product’s real-world performance nor tested it across demographic groups before release6. Furthermore, the model’s proprietary status makes it difficult for users to evaluate independently. Another critique is the ESM’s use of proxy variables such as ethnicity and marital status, a strategy with known risks7 that require an explicit assessment for bias or confounding.

What influences the values driving AI design choices? The work of Seyyed-Kalantari et al.1 reveals the importance of the healthcare context in understanding the implications of AI-driven decisions. Critical to contextual understanding is an awareness of known biases in healthcare practice and delivery. However, the potential for such understanding may be limited because the major players in the development of healthcare AI increasingly come from tech companies that lack healthcare expertise in key positions8. Collaboration among people with backgrounds in medicine, data sciences and engineering is crucial to the development of AI for healthcare, to bring together people with diverse professional responsibilities and value systems.

The influences of the professional norms of medical care and research, computer science and software engineering are thus in flux. The developer culture that evolves in healthcare AI, including its values, norms and practices, will be especially important given the lack of consensus around standards or a clear regulatory framework to guide or compel assessments of safety and efficacy. Will AI developers be up to the challenges of ensuring fairness and equity in AI development and implementing the recommendations of Seyyed-Kalantari et al.1, such as robust auditing of deployed algorithms? How will professional norms of medical care and research interact with those of computer science and software engineering? Will AI development teams include people with deep and specific knowledge of the relevant clinical domains? What incentives are there for AI developers to move beyond reporting AUCs, to take clinical considerations into account in selection of performance metrics or to conduct fairness checks?

A commitment to addressing the underdiagnosis of underserved populations by AI-based models and adopting the recommendations of Seyyed-Kalantari et al. and others will require more than technical solutions and modifications to development and evaluation processes. First, we must acknowledge that bias is not simply a feature of data that can be eliminated; it is defined and shaped by much deeper social and organizational forces9,10. For example, the use of socially constructed and government-mandated categories such as ‘Hispanic’ and ‘Asian’ for data classification is known to obscure a multitude of important health disparities11,12, which would then be perpetuated by AI models using these categories. A fundamental realignment of the professional norms of software development for healthcare applications that acknowledges developers’ responsibilities to patient health and welfare will be necessary. Values of speed, efficiency and cost control must not be prioritized over values of transparency, fairness and beneficence. Just as important to addressing bias, however, will be the identification of social and organizational factors that lead to inequity and injustice in data and AI modeling processes, and the widespread adoption of norms and practices that correct them.

 

全文翻译(仅供参考)

       基于人工智能的模型可能会放大数据集中预先存在的人类偏见;解决这个问题需要从根本上重新调整软件开发文化。

       在基于人工智能 (AI) 的预测模型中,偏差(定义为不公平的系统错误)越来越受到关注,尤其是在医疗保健应用中。尤其成问题的是,由于错误在易受伤害、历来受到歧视或社会边缘化的群体之间分布不均而造成的不公平。

       在本期Nature Medicine 中,Seyyed-Kalantari 及其同事1检查了三个大型公开可用的放射学数据集,以证明基于 AI 的胸部 X 射线预测模型中存在特定类型的偏差。他们发现,即使使用基于最先进计算机视觉技术的分类器,这些模型也更有可能错误地预测患者是否健康,如果他们属于服务不足的人群。换句话说,他们发现了诊断不足的偏见,这在伦理上尤其成问题,因为它会将已经服务不足的患者错误地归类为不需要治疗——从而加剧现有的健康差距。

       作者发现女性患者、20 岁以下患者、黑人患者、西班牙裔患者和有医疗补助保险的患者(通常社会经济地位较低)以及交叉亚组的诊断率一直存在不足。他们指出,虽然在临床护理的几个领域已经发现了服务不足患者诊断不足的例子,但预测模型可能会放大这种偏见。此外,转向基于自动自然语言处理 (NLP) 的标签,众所周知,这种标签对代表性不足的人群存在偏见,可能会导致服务不足群体之间诊断不足的差异。因此,这项研究揭示了医疗保健中一种重要但研究相对较少的偏见类型,

       作者提出了几项建议,通过在 AI 开发过程中的考虑来减少诊断不足。例如,他们建议应审核使用 NLP 的放射学报告的自动标记。他们还注意到公平(通过实现相等的假阴性率 (FNR) 和假阳性率 (FPR))和模型性能之间的权衡。然而,在询问“为了实现平等而恶化一个子组的整体模型性能是否在道德上是可取的”的问题时,作者还明确地将这种权衡作为价值观和技术考虑之一。

       临床医生的价值观反映在二值化指标 FPR 和 FNR 的选择上,而不是曲线下面积 (AUC),后者优先考虑对临床决策最有用的预测类型。对于诊断测试,AUC 是一个单一的度量标准,例如,在所有诊断阈值(例如“良性”或“确定为癌症”)中,测试将正确排列有病变患者和无病变患者的可能性。然而,它平均跨阈值,即使是那些与临床无关的阈值,并且没有关于相对敏感性和特异性的信息,将它们视为同等重要。由于未能识别或考虑患者价值观而为错误任务优化 AI 模型的危险在医疗保健环境中非常真实;对于一个病人,2 . 人类诊断学家认识到错误分类成本的差异,并“谨慎行事” 3。然而,不考虑现实世界影响以及对患者和临床医生重要的性能指标将具有误导性。此外,必须承认临床医生需要有关因果推断的信息才能采取行动,并且必须承认数据驱动模型提供此类信息的局限性4。

       Epic Sepsis 模型 (ESM) 的示例突出了在不同开发阶段做出的决策的一些影响。该模型是 Epic Systems 的电子健康记录平台中包含的一个工具,用于预测败血症的概率。ESM 因其在某些卫生系统中的表现不佳而受到批评,其特征是“比开发商(Epic Systems)所报告的要差得多” 5。然而,在发布6之前,开发人员既没有评估产品的实际性能,也没有跨人口群体进行测试. 此外,该模型的专有状态使用户难以独立评估。另一项批评是 ESM 使用了诸如种族和婚姻状况等代理变量,这是一种已知风险7的策略,需要对偏见或混杂进行明确评估。

       什么影响驱动 AI 设计选择的价值观?Seyyed-Kalantari 等人的工作。图 1揭示了医疗环境在理解 AI 驱动决策的影响方面的重要性。上下文理解的关键是意识到医疗保健实践和交付中的已知偏见。然而,这种理解的潜力可能有限,因为医疗保健人工智能发展的主要参与者越来越多地来自在关键职位上缺乏医疗保健专业知识的科技公司8。具有医学、数据科学和工程背景的人之间的合作对于医疗保健人工智能的发展至关重要,可以将具有不同专业责任和价值体系的人们聚集在一起。

       医疗保健和研究、计算机科学和软件工程的专业规范的影响因此在不断变化。鉴于缺乏围绕标准的共识或明确的监管框架来指导或强制评估安全性和有效性,在医疗保健 AI 中发展的开发人员文化,包括其价值观、规范和实践,将尤为重要。AI 开发人员能否应对确保 AI 开发的公平性和公平性以及实施 Seyyed-Kalantari 等人的建议的挑战?1,例如对已部署算法的稳健审计?医疗保健和研究的专业规范将如何与计算机科学和软件工程的规范相互作用?AI 开发团队是否包括对相关临床领域具有深入和特定知识的人员?AI 开发人员有什么动机可以超越报告 AUC,在选择性能指标或进行公平性检查时考虑临床因素?

       致力于通过基于人工智能的模型解决服务不足人群的诊断不足问题,并采纳 Seyyed-Kalantari 等人的建议。和其他人需要的不仅仅是技术解决方案以及对开发和评估过程的修改。首先,我们必须承认,偏见不仅仅是可以消除的数据特征;它是由更深层次的社会和组织力量定义和塑造的9 , 10。例如,众所周知,使用社会建构和政府规定的类别(例如“西班牙裔”和“亚洲人”)进行数据分类会掩盖许多重要的健康差异11、12,然后由使用这些类别的 AI 模型延续。有必要对医疗保健应用软件开发的专业规范进行根本性的调整,承认开发人员对患者健康和福利的责任。速度、效率和成本控制的价值不能优先于透明、公平和慈善的价值。然而,对于解决偏见同样重要的是确定导致数据和人工智能建模过程中的不公平和不公正的社会和组织因素,以及广泛采用纠正它们的规范和实践。

不感兴趣

看过了

取消

本文由“健康号”用户上传、授权发布,以上内容(含文字、图片、视频)不代表健康界立场。“健康号”系信息发布平台,仅提供信息存储服务,如有转载、侵权等任何问题,请联系健康界(jkh@hmkx.cn)处理。
关键词:
AI,bias,模型,医疗保健,healthcare

人点赞

收藏

人收藏

打赏

打赏

不感兴趣

看过了

取消

我有话说

0条评论

0/500

评论字数超出限制

表情
评论

为你推荐

推荐课程


社群

  • 医生交流群 加入
  • 医院运营群 加入
  • 医技交流群 加入
  • 护士交流群 加入
  • 大健康行业交流群 加入

精彩视频

您的申请提交成功

确定 取消
剩余5
×

打赏金额

认可我就打赏我~

1元 5元 10元 20元 50元 其它

打赏

打赏作者

认可我就打赏我~

×

扫描二维码

立即打赏给Ta吧!

温馨提示:仅支持微信支付!