From Philly and the Pa. suburbs to South Jersey and Delaware, what would you like WHYY News to cover? Let us know!

Across the country, educators are growing increasingly concerned about the impact of ChatGPT and other AI chatbots on students’ learning — and with good reason.

A recent survey by Inside Higher Ed found that a staggering 85% of college students have used generative AI for coursework in the past year — with a quarter using it to complete assignments for them, and 19% to write full essays.

“We have had some examples of students using ChatGPT, using Gemini, using things and then presenting that work as their own work,” said Amy Mount, the director of curriculum and instruction at Gateway Regional School District of New Jersey.

With the availability of different AI platforms, guarding against AI use can be hard.

“Even if I block ChatGPT, they’ll go to Claude. If I block Claude, they’ll go to Gemini. If I go — right? There’s no way to block every single piece of generative AI,” said Mount.

The widespread use of AI has sparked a dilemma for teachers and professors — how do you tell the difference between assignments written by students, and those written by a bot?

It’s a thorny question with no easy solutions.

“When I first started researching into it, and trying it out myself, there was somebody coming out with a way to keep kids from cheating with it,” says Kathleen Bially, a media specialist at Gateway Regional High School. “And then they realized that there were so many roadblocks that there is no way to ever know fully if a student is cheating with AI.”

That hasn’t stopped educators from trying to find ways of identifying AI-produced work, ranging from AI detectors, to scanning essays for common hallmarks of AI writing, to following their gut instinct that something is off with an assignment. But how accurate are these methods? And are they reliable enough to accuse students of plagiarism, potentially leading to disciplinary action or even expulsion?

How accurate are AI detectors?

These are questions that Chris Callison-Burch, a professor of Computer and Information Science at the University of Pennsylvania and director of the school’s new online AI masters program, has been exploring, along with his PhD student, Liam Dugan.

Last year, the duo released a study designed to investigate how accurate commercial AI detectors — which are themselves powered by machine learning — really are at identifying text produced by generative AI.

While many of the companies producing these detectors have boasted accuracy rates of 99%, Callison-Burch and Dugan had their doubts.

“They would only evaluate their detectors on their own generated data sets,” Dugan said. “So they didn’t have a publicly available set of machine-generated text that everyone was evaluating on.”

And there were other reasons to be suspicious.

“A lot of counter opinions were being voiced — like OpenAI said they had given up on their own internal efforts at being able to spot AI-generated text because the path was too hard,” Callison-Burch said. “It was too hard to train a system to spot AI-generated text.”

To assess the detectors’ true accuracy, Callison-Burch, Dugan, and their co-authors tested them using RAID (Robust AI Detector), a dataset they created that contains over 10 million documents from different sources, both human and AI. The texts span multiple genres and include the work of nearly a dozen large language models, along with “adversarial attacks,” or methods designed to fool AI detectors.