Microsfot promises to revolutionize medical diagnostics

An artificial intelligence platform developed by Microsoft claims to diagnose diseases with 85.5% accuracy, four times more accurate than human doctors.

The system, called the AI Diagnostic Orchestrator, not only analyzes clinical data, but simulates a debate between virtual agents who reason like real doctors.

This breakthrough raises unsettling questions: Will AI be the new oracle of medicine? Or just another actor in a scenario where human judgment is still essential?

“AI models are becoming dramatically better than humans”

By: Gabriel E. Levy B.

Diagnostic errors are one of the leading causes of preventable deaths in health systems.

According to a study published in BMJ Quality & Safety, about 5% of adults in the United States receive a medical misdiagnosis each year, which equates to 12 million people.

Of these, a third suffer serious consequences. In this context, the interest in incorporating artificial intelligence (AI) technologies is not a futuristic whim, but a concrete necessity.

Microsoft, one of the most active companies in the race to lead AI, presented its proposal to change the course of medical diagnostics: the AI Diagnostic Orchestrator (MAI-DxO).

This system was developed by Mustafa Suleyman, co-founder of DeepMind and current director of the artificial intelligence area at Microsoft.

The proposal consists of bringing together several language models, specifically, five AI agents, to jointly analyze clinical cases and reach a consensus diagnosis.

Unlike previous tools, which worked in a unidirectional way, this model introduces debate and contradiction between algorithms.

The trial was carried out with 304 real cases extracted from the New England Journal of Medicine, one of the most prestigious publications in the scientific field. The AI obtained 85.5% accuracy in its diagnoses, especially when using OpenAI’s GPT-4 model.

In contrast, a group of human doctors, deprived of usual complementary resources such as databases or images, only got it right in 20% of the cases. Although the design of the experiment drew some criticism, the difference in performance was too wide to ignore.

“The future does not depend on the model, but on the orchestrator”

The key to the MAI-DxO system lies not only in the use of large language models (LLMs), but in their collaborative design.

As Suleyman explained to the Financial Times, “AI models tend to become commodities; What really makes the difference is the added value of the orchestrator.”

This statement sums up the approach that Microsoft wants to install in the medicine of the future: it is not just about having a powerful AI, but about organizing it as a symphony of diverse clinical reasoning.

From a technical point of view, LLMs such as GPT-4 are capable of interpreting symptoms, comparing antecedents, and generating diagnostic hypotheses with a speed impossible for humans. But the real qualitative leap of Microsoft’s system is in allowing these agents to confront each other, as if they were specialists in a medical board.

This reduces individual bias of a single model and simulates richer deliberation, more akin to clinical team thinking.

For now, the company has not announced a specific commercial application, but there is speculation that it could be integrated into platforms such as Bing or Copilot, Microsoft’s conversational interfaces.

This would open the door to AI accessible to professionals and patients alike, although it would also raise ethical and regulatory dilemmas that have not yet been resolved.

Who is responsible if the automated diagnosis is wrong?

How do you ensure the privacy of clinical data?

What about the doctor-patient relationship?

In addition to performance, the system aims at an underlying economic objective: to reduce waste in the health system.

In the United States, about 25% of health spending, more than 800,000 million dollars annually, corresponds to unnecessary or poorly indicated procedures.

If AI could improve diagnostic accuracy, it could also optimize the distribution of resources and avoid medical interventions that do not benefit the patient.

“Faster, cheaper and four times more accurate”

Suleyman’s most provocative claim, AI is “four times more accurate than humans,” sparked a wave of reactions in the medical field. Some consider it a symptom of technological arrogance; others see it as an opportunity to rethink the role of the health professional in the digital age.

In any case, the comparison is unavoidable.

For years, authors such as Eric Topol, cardiologist and author of Deep Medicine, have argued that medicine is at a crossroads between humanism and automation.

For him, the ideal future is not one in which machines replace doctors, but in which they free them from repetitive tasks and bring them back into human contact. “We don’t need AI to replace us, we need it to allow us to be more human,” he wrote in 2019.

David Sontag, an MIT researcher and specialist in data science applied to medicine, offered a more pragmatic criticism: the doctors who participated in the study did not have the tools they would normally use in daily practice.

This, in their opinion, distorts the comparison and reduces the external validity of the results.

However, he acknowledged that the level of clinical demand of the test was higher than that of other similar trials.

Another point to consider is the risk of blind trust in the models. As sociologist Shoshana Zuboff, author of The Age of Surveillance Capitalism, warned, automated decisions are not without bias and errors, and the more opaque the functioning of the system, the greater the risk of uncritical dependence.

In medicine, where incorrect interpretation can cost lives, this warning becomes crucial.

Cases that illustrate the promise… and the dilemma

In 2023, Stanford University Hospital tested an AI system similar to Microsoft’s in its emergency department.

The tool was able to diagnose acute appendicitis with 91% accuracy, compared to 75% of resident physicians.

The implementation made it possible to reduce the average time of care from 3.5 hours to 2 hours, according to internal data from the center.

However, there were also reported cases in which AI suggested misdiagnoses, such as confusing pancreatitis with a complicated UTI.

In China, Shanghai’s Ruijin Hospital implemented an LLM-based virtual medical assistant for the early detection of lung diseases.

The system, integrated with CT imaging and clinical data, identified precancerous lesions with a sensitivity rate greater than 88%.

This allowed intervention earlier in several patients, avoiding fatal progressions.

However, the model showed a lower performance when applied in other regions of the country, with different genetic and epidemiological profiles.

Also in Brazil, a pilot project of the Ministry of Health used a conversational AI model to assist rural doctors in basic diagnoses.

In areas where there is only one doctor for every 10,000 inhabitants, the tool offered a notable improvement in response times.

But a report by the Public Health Observatory warned that the quality of the recommendations decreased significantly when the internet connection was unstable, which shows the fragility of the infrastructure.

These cases show that the success of artificial intelligence in medicine does not depend only on the model, but on the ecosystem that surrounds it: connectivity, training, regulation, clinical culture.

And above all, how collaboration between humans and machines is articulated.

In conclusion

The promise of an AI capable of diagnosing better than a doctor is no longer science fiction. However, its real impact will depend on more than just precision figures: it will be necessary to design environments where technology enhances, rather than supplants, clinical judgment.

The medicine of the future will be neither fully human nor completely artificial, but an alliance between the best of both worlds.