AI is trained on YouTube videos. Now we can see which ones have been used.

The unlicensed reuse of YouTube videos as generative AI training material is a hot-button issue in the creator world, and a new report shows how deep that rabbit hole really goes. As part of an investigation titled AI Watchdog, The Atlantic explored large AI training data sets and published a tool that lets users search through those sets in search of specific creators and channels.

Videos that are repurposed for AI training typically have identifying details removed, but Atlantic reporter Alex Reisner found a workaround by “extracting unique identifiers from the data sets and looking them up on YouTube.” From there, he was able to find the sources of the content included in these sets, a list that included creators like Jon Peters.

“A large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000),” Reisner wrote. “Hundreds of thousands of others—if not more—are from individual creators, such as Peters.”

Tubefilter Subscribe to get the latest creator newsSubscribe

The report will undoubtedly cause further alarm among the creators who were already up in arms regarding the questionable ethics of AI training sets. Ripping YouTube videos en masse violates the platform’s terms of service, but the presence of creator channels in these data sets shows that YouTube has more work to do if it wants to fully enforce that portion of its rules.

Once the data sets reach tech companies, culpability is hard to pin down. Since the Nvidias and Metas of the world don’t rip the training videos off YouTube themselves, they have plausible deniability when it comes to third-party sets. Spokespeople from Nvidia, Meta, and Amazon who responded to The Atlantic‘s request for comment expressed their shared belief that the YouTube-based training they do is completely above board.

Those statements may make some creators feel as if they are helpless to defend the ownership of their work. There are, however, some steps that can be taken by those who want to prevent their videos from being repurposed for AI training. Including overlays, such as watermarks or captions, makes videos less appealing to genAI developers (who don’t want those features to show up in the content generated by their models). Platforms like YouTube also lets creators opt out of having their videos made available for training, though based on the Atlantic report, it’s fair to question how effective that toggle is.

The court of law could provide another layer of defense against unauthorized AI training, but creator-led lawsuits — such as David Millette‘s case against Nvidia — have already been dismissed. If judges do end up siding with AI developer

AI is trained on YouTube videos. Now we can see which ones have been used.

Tags: