Study shows OpenAI's AI models 'memorize' copyrighted content

By Akash Pandey

Apr 05, 2025

01:32 pm

What's the story

A recent study has indicated that OpenAI, the company behind ChatGPT, may have trained its AI models on copyrighted material, including books, codebases, and more.

The revelation comes as authors, programmers, and other rights-holders continue to sue OpenAI for using their works without permission.

The company has always defended its actions as fair use under US copyright law, but plaintiffs argue there's no exception for training data.

Detection technique

Study proposes new method to detect 'memorization'

The study, co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford, proposes a novel method for identifying training data "memorized" by models behind an API like OpenAI's.

The researchers examined several OpenAI models including GPT-4 and GPT-3.5 for signs of memorization.

They did this by removing certain words from snippets of fiction books and New York Times articles and having the models attempt to identify which words had been omitted.

Results

Findings reveal potential memorization of copyrighted material

The study discovered that when high-surprisal words were omitted from text snippets, the models were able to accurately predict the missing words, indicating they had memorized them during training.

In particular, GPT-4 showed signs of memorizing portions of popular fiction books and excerpts from New York Times articles.

Abhilasha Ravichander, a University of Washington doctoral student and co-author of the study, stressed these findings underscore the "contentious data" used in training models.

Transparency demand

Call for greater data transparency in AI training

Ravichander said, "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically."

She also stressed the need for more data transparency in the AI training ecosystem.

OpenAI had previously advocated for less restrictions on model development with copyrighted material.

The company has content licensing agreements and opt-out mechanisms for copyright owners to flag content they don't want to be used for training.