WASHINGTON — It’s not Skynet, yet, but in an Air Force experiment artificial intelligence tools managed to out-perform human professionals in a key piece of planning military operations, service officials recently revealed.
In the service’s latest “DASH” experiment this past fall, the Air Force pitted AI tools from half a dozen companies against military personnel from the US, Canada and UK and asked each to solve hypothetical “battle management” problems, from standard Air Force tasks like planning an airstrike or rerouting aircraft whose home base had been damaged, to more obscure scenarios like gathering intelligence on an anomalous electromagnetic signal or protecting a disabled and drifting Navy vessel.
When the Air Force evaluated the resulting proposed Courses of Action (COAs), it found that at least one of the AI algorithms not only generated more COAs in less time than the humans, but it actually made fewer errors than the humans did as well.
“These machine-generated recommendations were up to 90 percent faster than traditional methods, with the best in machine-class solutions showing 97 percent viability and tactical validity,” said Col. John Ohlund, director of the Air Force’s Advanced Battle Management System Cross-Functional Team (ABMS CFT), according to an official article published by the Air Force on Monday about DASH-3 (Decision Advantage Sprint for Human-Machine Teaming).
The article said that human-generated courses of action took about 19 minutes, with 48 percent of the options “being considered viable and tactically valid.”
“And our team didn’t observe hallucinations during the experiment,” Ohlund said.
That’s a stark contrast both to other recent Air Force tests, where experimental planning algorithms made subtle errors, and to broader concerns about generative AI being prone to output grossly inaccurate “hallucinations.”
Now, there are plenty of caveats to that human-machine comparison, the colonel cautioned in a follow-up interview with Breaking Defense. The first was sheer complexity and unfamiliarity of the scenario, deliberately designed to stress and stretch the human participants.
“We challenged the operators outside of their comfort zone for their normal training,” Ohlund explained. “We gave them multi-domain problems to solve [in] one hour.”
The reason for this approach? The overall project Ohlund is working on, ABMS, is meant to be a multi-service and “multi-domain” command system — that is, one for all the services to use for operations involving all the officially recognized “domains” of air, land, sea, space, and cyberspace.
But the personnel participating in the experiment, unsurprisingly, were mostly trained and experienced in Air Force operations in the air. One relatively junior participant, for instance, was an enlisted airman with two years’ training, who simply didn’t have experience with electronic warfare capabilities.
What’s more, the participants were used to working on actual command-and-control networks with real-life performance data. But those networks and that data are classified, meaning it might take months to get novel software certified as secure to use with them. So the experiment used unclassified approximations of the real thing. That allowed six different civilian companies to provide AI tools, but also added another layer of unfamiliarity for the military participants.
On the other hand, the AI tools obviously hadn’t completed Joint Professional Military Education courses or spent years working in joint operations centers, either. So the scenario and the data were unfamiliar to both the organic and the digital participants.
In fact, both humans and AI received the same pre-wargame briefing with “probably about 20 pages of documents to read before the experiment,” Ohlund said. That ranged from the broad guidance known as “commander’s intent” to spreadsheets full of stats on different missiles, jammers, and so on, including their probabilities of successfully defeating a given target. (Telling a generative AI to base its outputs on a clearly defined, human-verified dataset in this way is a well-attested technique for combating hallucinations.)
The difference is that the humans often got overwhelmed by all the new information and forgot or misremembered it when crunch time came. The best AIs retained absolutely everything.
“The computer does not forget,” Ohlund said.
The computer also did best with a lot of human prep work. For the specific AI tool whose proposed Courses Of Action were 97 percent “viable and tactically valid,” the software development team had made sure their algorithm could actually digest the briefing package. Ohlund said, “that particular company had normalized all the data, had adjusted every bit of spreadsheets, had translated all the narratives.” In its report, the Air Force did not identify the companies involved in the experiment.
The scenario was also designed to stress the humans out — a factor that didn’t affect the emotionless algorithms at all.
“We wanted the operators to have a realistic operating environment, meaning they’re not just sitting there waiting to solve a problem … . They have to be stressed,” Ohlund told Breaking Defense. “Time critical, you’re under pressure, you’ve got to make a decision right now. Formulate your plan, give me your plan, go go go.”
In other words, human beings do best when they’re doing familiar things in familiar environments, but AIs can ingest a firehose of novel information unfazed.
So what comes next? None of the six AI tools in the experiment is ready for operational use by planning staffs today, Ohlund said. None of this is a standalone replacement for human planners, either. Instead, he explained, they’re intended to evolve into “microservices” that plug into a larger military command-and-control system.
In fact, in the formalized, machine-friendly framework used by the ABMS team’s AI architecture, known the Transformational Model, generating Courses Of Action — which is what AI did in this experiment — is just one step out of 13 to develop an executable plan. Of course, more algorithmic microservices can be developed to handle those other 12 steps as well.
“I was skeptical about technology being integrated into decision-making,” said one participant, Lt. Ashley Nguyen, according to the official Air Force article. “But working with the tools, I saw how user-friendly and timesaving they could be. The AI didn’t replace us; it gave us a solid starting point to build from.”