It seems inevitable that the future of medicine will involve A.I., and medical schools are already encouraging students to use large language models. “I’m worried these tools will erode my ability to make an independent diagnosis,” Benjamin Popokh, a medical student at University of Texas Southwestern, told me. Popokh decided to become a doctor after a twelve-year-old cousin died of a brain tumor. On a recent rotation, his professors asked his class to work through a case using A.I. tools such as ChatGPT and OpenEvidence, an increasingly popular medical L.L.M. that provides free access to health-care professionals. Each chatbot correctly diagnosed a blood clot in the lungs. “There was no control group,” Popokh said, meaning that none of the students worked through the case unassisted. For a time, Popokh found himself using A.I. after virtually every patient encounter. “I started to feel dirty presenting my thoughts to attending physicians, knowing they were actually the A.I.’s thoughts,” he told me. One day, as he left the hospital, he had an unsettling realization: he hadn’t thought about a single patient independently that day. He decided that, from then on, he would force himself to settle on a diagnosis before consulting artificial intelligence. “I went to medical school to become a real, capital-‘D’ doctor,” he told me. “If all you do is plug symptoms into an A.I., are you still a doctor, or are you just slightly better at prompting A.I. than your patients?”
A few weeks after the CaBot demonstration, Manrai gave me access to the model. It was trained on C.P.C.s from The New England Journal of Medicine; I first tested it on cases from the JAMA network, a family of leading medical journals. It made accurate diagnoses of patients with a variety of conditions, including rashes, lumps, growths, and muscle loss, with a small number of exceptions: it mistook one type of tumor for another and misdiagnosed a viral mouth ulcer as cancer. (ChatGPT, in comparison, misdiagnosed about half the cases I gave it, mistaking cancer for an infection and an allergic reaction for an autoimmune condition.) Real patients do not present as carefully curated case studies, however, and I wanted to see how CaBot would respond to the kinds of situations that doctors actually encounter.
I gave CaBot the broad stokes of what Matthew Williams had experienced: bike ride, dinner, abdominal pain, vomiting, two emergency-department visits. I didn’t organize the information in the way that a doctor would. Alarmingly, when CaBot generated one of its crisp presentations, the slides were full of made-up lab values, vital signs, and exam findings. “Abdomen looks distended up top,” the A.I. said, incorrectly. “When you rock him gently, you hear that classic succussion splash—liquid sloshing in a closed container.” CaBot even conjured up a report of a CT scan that supposedly showed Williams’s bloated stomach. It arrived at a mistaken diagnosis of gastric volvulus: a twisting of the stomach, not the bowel.
I tried giving CaBot a formal summary of Williams’s second emergency visit, as detailed by the doctors who saw him, and this produced a very different result—presumably because they had more data, sorted by salience. The patient’s hemoglobin level had plummeted; his white cells, or leukocytes, had multiplied; he was doubled over in pain. This time, CaBot latched on to the pertinent data and did not seem to make anything up. “Strangulation indicators—constant pain, leukocytosis, dropping hemoglobin—are all flashing at us,” it said. CaBot diagnosed an obstruction in the small intestines, possibly owing to volvulus or a hernia. “Get surgery involved early,” it said. Technically, CaBot was slightly off the mark: Williams’s problem arose in the large, not the small, intestine. But the next steps would have been virtually identical. A surgeon would have found the intestinal knot.
Talking to CaBot was both empowering and unnerving. I felt as though I could now receive a second opinion, in any specialty, anytime I wanted. But only with vigilance and medical training could I take full advantage of its abilities—and detect its mistakes. A.I. models can sound like Ph.D.s, even while making grade-school errors in judgment. Chatbots can’t examine patients, and they’re known to struggle with open-ended queries. Their output gets better when you emphasize what’s most important, but most people aren’t trained to sort symptoms in that way. A person with chest pain might be experiencing acid reflux, inflammation, or a heart attack; a doctor would ask whether the pain happens when they eat, when they walk, or when they’re lying in bed. If the person leans forward, does the pain worsen or lessen? Sometimes we listen for phrases that dramatically increase the odds of a particular condition. “Worst headache of my life” may mean brain hemorrhage; “curtain over my eye” suggests a retinal-artery blockage. The difference between A.I. and earlier diagnostic technologies is like the difference between a power saw and a hacksaw. But a user who’s not careful could cut off a finger.
Attend enough clinicopathological conferences, or watch enough episodes of “House,” and every medical case starts to sound like a mystery to be solved. Lisa Sanders, the doctor at the center of the Times Magazine column and Netflix series “Diagnosis,” has compared her work to that of Sherlock Holmes. But the daily practice of medicine is often far more routine and repetitive. On a rotation at a V.A. hospital during my training, for example, I felt less like Sherlock than like Sisyphus. Virtually every patient, it seemed, presented with some combination of emphysema, heart failure, diabetes, chronic kidney disease, and high blood pressure. I became acquainted with a new phrase—“likely multifactorial,” which meant that there were several explanations for what the patient was experiencing—and I looked for ways to address one condition without exacerbating another. (Draining fluid to relieve an overloaded heart, for example, can easily dehydrate the kidneys.) Sometimes a precise diagnosis was beside the point; a patient might come in with shortness of breath and low oxygen levels and be treated for chronic obstructive pulmonary disease, heart failure, and pneumonia. Sometimes we never figured out which had caused a given episode—yet we could help the patient feel better and send him home. Asking an A.I. to diagnose him would not have offered us much clarity; in practice, there was no neat and satisfying solution.
Tasking an A.I. with solving a medical case makes the mistake of “starting with the end,” according to Gurpreet Dhaliwal, a physician at the University of California, San Francisco, whom the Times once described as “one of the most skillful clinical diagnosticians in practice.” In Dhaliwal’s view, doctors are better off asking A.I. for help with “wayfinding”: instead of asking what sickened a patient, a doctor could ask a model to identify trends in the patient’s trajectory, along with important details that the doctor might have missed. The model would not give the doctor orders to follow; instead, it might alert her to a recent study, propose a helpful blood test, or unearth a lab result in a decades-old medical record. Dhaliwal’s vision for medical A.I. recognizes the difference between diagnosing people and competently caring for them. “Just because you have a Japanese-English dictionary in your desk doesn’t mean you’re fluent in Japanese,” he told me.
“I don’t care what they call it—I need my iced coffee to be at least this tall.”
Cartoon by Lauren Simkin Berke
CaBot remains experimental, but other A.I. tools are already shaping patient care. ChatGPT is blocked on my hospital’s network, but I and many of my colleagues use OpenEvidence. The platform has licensing agreements with top medical journals and says it complies with the patient-privacy law HIPAA. Each of its answers cites a set of peer- reviewed articles, sometimes including an exact figure or a verbatim quote from a relevant paper, to prevent hallucinations. When I gave OpenEvidence a recent case, it didn’t immediately try to solve the mystery but, rather, asked me a series of clarifying questions.