Summarize

OpenAI to disclose AI training data in copyright infringement lawsuit

By Akash Pandey

Sep 26, 2024

11:36 am

What's the story

OpenAI, a leading artificial intelligence (AI) developer, has agreed to disclose the training data used for its generative AI models.

This decision comes in response to a copyright infringement lawsuit filed by several authors including Paul Tremblay, Sarah Silverman, Michael Chabon, David Henry Hwang, and Ta-Nehisi Coates.

The plaintiffs allege that OpenAI's models were trained on their copyrighted books without permission.

Legal battle

Authors allege copyright violation

The authors claim that OpenAI's models reproduce their words, violating US copyright law and California's unfair competition rules.

Their individual lawsuits have been consolidated into a single claim.

This is not the first time OpenAI has faced such allegations, with similar claims being made by other plaintiffs earlier this year.

Anthropic, another AI developer, was also sued on similar grounds.

Data disclosure

Court order outlines strict access conditions to OpenAI's data

US Magistrate Judge Robert Illman has issued an order outlining the protocols and conditions for the plaintiffs' attorneys to access OpenAI's training data.

The court order treats this data as sensitive source code or a proprietary business process, and sets strict rules for its examination.

It is believed that models like ChatGPT (GPT-3.5, GPT-4, etc.) relied heavily on publicly accessible data.

Examination protocol

Training data to be examined under some conditions

According to Judge Illman's order, "Training data shall be made available by OpenAI in a secure room on a secured computer without internet access or network access to other unauthorized computers or devices."

The order also prohibits recording devices in the secure room and allows OpenAI's legal team to inspect any notes taken during the examination.

This level of secrecy is likely due to potential legal liability concerns.

Defense strategy

OpenAI maintains its use of copyrighted content is fair

Despite facing numerous copyright claims, OpenAI continues to assert that its use of copyrighted content qualifies as fair use and is legally defensible.

The company's attorneys argue that if the plaintiffs' books were used to train their models, it would be considered transformative fair use.

They further contend that generative AI is about creating new content rather than reproducing training data.

Legal interpretation

Attorneys argue AI models create new content

OpenAI's legal team argues that their models do not infringe copyright as they merely extract statistical data during training.

They state, "The purpose of those models is not to output material that already exists; there are much less computationally intensive ways to do that."

Instead, they claim these models generate new material based on an understanding of language and reasoning.