Did GPT-4 Memorize Copyrighted Books? New Study Sparks Legal and Ethical Concerns

Published by

10 months ago

OpenAI is once again at the center of controversy as a new academic study reveals its AI models, including GPT-4 and GPT-3.5, may have memorized and reproduced copyrighted content during training.

Researchers from the University of Washington, Stanford University, and the University of Copenhagen unveiled a novel method to detect whether large language models (LLMs) like GPT-4 have internalized specific pieces of text from their training data.

The research results suggest that the models under investigation have retained memory-related content from widely recognised fiction publications and media platforms, such as the New York Times.

The “high-surprisal” words, which are uncommon statistical terms that serve as indicators of memorised information, were assessed by the research team. The research team used text fragment masking techniques to assess the model’s prediction skills before determining which terms were most likely encountered during the training sessions.

This strategy is the specific method by which memorization detection methods work, because models must have previously detected the precise phrasing in order to effectively anticipate unexpected terms.

The research demonstrated that GPT-4 from OpenAI was able to effectively reconstruct missing high-surprising words, specifically from the copyrighted BookMIA datasets that contained published e-books. GPT-4 exhibited a diminished capacity to recall news content; however, there was evidence that the model was digesting New York Times materials to a certain extent.

“This work highlights the need for deeper transparency in how training data is sourced and handled,” said Abhilasha Ravichander, a University of Washington Ph.D. student and co-author of the study. “We need scientific tools to audit and examine the behavior of these models. Without that, we’re flying blind in terms of trust and accountability.”

Recently, OpenAI has been sued by a number of authors, developers, and content creators for its unauthorised use of intellectual property. OpenAI defends its data processing by asserting that it employs the “fair use” doctrine; however, its detractors dispute that U.S. copyright law explicitly endorses large-scale model training through data mining.

Publishing agreements for content usage have been established by OpenAI, and publishers have the option to opt out of model access. OpenAI continues to advocate for legislative reforms that would grant AI developers greater access to copyrighted content. This posture has sparked significant debate among legal professionals and creative professionals.