Facilitating AMIE’s physician-centered oversight

Towards conversational diagnostic artificial intelligence | Nature

Our research AI system for medical reasoning and diagnostic dialogue, Articulate Medical Intelligence Explorer (AMIE), was recently shown to be able to provide accurate medical advice in text-based simulations of patient visits. However, prior to any patient communication, individual patient diagnoses and treatment plans are regulated activities that must be reviewed and approved by licensed medical professionals. While overseeing primary care physicians (PCPs) maintain accountability for the patient’s care, oversight is an established medical paradigm that allows care team members autonomy. Our current study investigates a framework for physician oversight of AMIE in light of this. Guardrailed-AMIE (g-AMIE), an extension of our AMIE research system with a multi-agent setup based on Gemini 2.0 Flash, is presented in “Towards physician-centered oversight of conversational diagnostic AI.” g-AMIE can gather patient information (i.e., history taking) via dialogue and generate a body of information for a clinician to review.

A draft message to the patient, a proposed differential diagnosis and management plan, and a summary of the collected data are all included in this. g-AMIE is designed with guardrail constraints that prevent it from sharing any individualized medical advice, such as a patient-specific diagnosis or treatment plan. A specialized web interface known as the clinician cockpit allows an overseeing PCP to review and edit this information. The overseeing PCP is able to review cases asynchronously because the taking of histories and making medical decisions are decoupled. In a randomized, blinded, virtual objective structured clinical examination (OSCE), we compared g-AMIE’s performance with nurse practitioners (NPs), physicians assistants/associates (PAs), and PCPs operating under the same guardrail constraints. We found that g-AMIE’s diagnostic performance and management plans were preferred by overseeing PCPs and independent physician raters. Additionally, patient actors favored g-AMIE’s patient messages. Even though this is a significant step toward human–AI collaboration with AMIE, it is important to interpret the results carefully, especially when comparing them to clinicians. While clinicians haven’t been trained to work within this framework, the workflow was designed with AI systems in mind.

An oversight cockpit for clinicians Using our clinician cockpit interface

Which we developed in a co-design study with ten outpatient physicians, g-AMIE generates a comprehensive medical note to facilitate physician oversight. The co-design was conducted through semi-structured interviews with potential users and thematic analysis to identify crucial components before results were shared with a UI designer to draft the interface. The cockpit is based on the widely-used SOAP note format which includes subsections for Subjective (the patient’s perspective on their condition), Objective (observable and measurable patient data, such as vital signs or lab data), Assessment (differential diagnosis with justification), and Plan (management strategy).Preparing medical records and taking notes We developed a multi-agent system with a dialogue agent, a guardrail agent, and a SOAP note agent in order for g-AMIE to respect its guardrails.

While taking history and produce high-quality, accurate SOAP notes

The objective of the dialogue agent is to carry out high-quality history taking in three phases: the general history taking phase, the targeted validation phase of an initial differential diagnosis, and the conclusion phase that addresses questions from the patient. Rephrasing responses as necessary, the guardrail agent ensures that the dialogue agent’s responses do not contain any individualized medical advice. Separating the summarization tasks (Subjective and Objective) from the inferential tasks (Assessment and Plan) and the generation of patient messages, the SOAP note agent performs sequential multi-step generation.

OSCE-randomized study

We conducted a virtual OSCE study to contextualize g-AMIE’s performance across 60 case scenarios developed by an OSCE laboratory. Patient actors, g-AMIE, and two control groups—NP/PAs and PCPs with less than five years of independent practice experience—were all part of the study. Following their history taking, both groups were instructed to draft SOAP notes and patient messages in addition to adhering to the same guidelines as g-AMIE. These control groups are referred to as guardrail PCPs and guardrail NP/PAs by us. In a second step, a group of overseering PCPs, with at least 5 years experience including experience supervising healthcare team members, reviewed SOAP notes, made edits, and either approved the (edited) patient message or decided that a follow-up consultation would be needed. Finally, independent physicians rated the overall quality of the transcript, SOAP notes and the overseeing PCP’s decision. Patient actors, overseeing PCPs and independent raters were blinded as to whether history taking had been performed by g-AMIE or a control group. By evaluating the quality of SOAP notes with a reworked QNote rubric and adding oversight-specific rubrics, we improved on previous evaluation rubrics.

Evaluation

Taking history, using SOAP, and having oversight experience

We found that g-AMIE was able to successfully take a patient’s medical history within guardrails, but that no consultation provided specific medical advice. In addition, g-AMIE’s SOAP notes were judged to be more comprehensive, accurate, and readable than those of the control groups, and its capacity to obtain essential patient information was rated higher than that of the control groups. Overseeing PCPs preferred overseeing g-AMIE and were more likely to accept patient messages crafted by g-AMIE.