Meta’s AI Models Trained on Copyrighted Books? Court Filings Say Yes
Recently unsealed court documents have disclosed internal discussions at Meta regarding the use of copyrighted materials for training its artificial intelligence (AI) models, raising legal and ethical concerns.
Meta’s methodology for acquiring training data has been disclosed by the lawsuit Kadrey v. Meta, which was submitted in the U.S. District Court for the Northern District of California. The corporation contends that the use of copyrighted literature is considered “fair use,” a claim that has been contested by authors such as Sarah Silverman and Ta-Nehisi Coates.
According to court filings, internal chats among Meta employees reveal a willingness to use copyrighted books, even when the legality was questionable. During a February 2023 meeting, research engineer Xavier Martinet proposed moving forward with the project without formal permissions.
“[M]y opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call,” Martinet stated in a chat, according to the filings. He further noted that the AI research division was established to take more risks.
Instead of securing licenses, Martinet advised that the company purchase e-book copies from conventional outlets to build its database. The manager comforted his colleague regarding the legal ramifications, noting that unauthorized training with pirated texts was frequent in fledgling organizations.
“I mean, worst case: we found out it is finally ok, while a gazillion start up [sic] just pirated tons of books on bittorrent,” he added.
Meta’s Legal Approach and Licensing Negotiations
According to corporate documents, Meta was in active discussions with Scribd and other platforms to develop permitting arrangements. Melanie Kambadur, a senior manager, indicated that Meta’s legal staff had allowed the usage of “publicly available data” with few constraints.
“Yeah we definitely need to get licenses or approvals on publicly available data still,” Kambadur noted in internal communications. “[D]ifference now is we have more money, more lawyers, more bizdev help, ability to fast track/escalate for speed, and lawyers are being a bit less conservative on approvals.”
Meta employees conducted research on Libgen, a renowned website that hosts copyrighted books. Kambadur mentioned Libgen during the discussion, which led to his colleague sharing a screenshot of “Libgen is not legal.”
Libgen was acknowledged by certain employees within the organization as a critical component of effective market competition. As per the communication from Meta AI Vice President Joelle Pineau to Director of Product Management Sony Theakanath, the essential use of Libgen is required for SOTA numbers in all categories.
To mitigate legal risks, Theakanath suggested filtering out obviously pirated material and avoiding public disclosure of Libgen’s use. “We would not disclose use of Libgen datasets used to train,” he wrote.
AI Model Adjustments to Avoid Copyright Issues
The organization modified its artificial intelligence (AI) systems to prevent the generation of responses that could potentially reveal the source of the training data. The modifications included two significant changes: the cessation of the artificial generation of copyrighted book passages and the restriction of responses to inquiries that inquired about the e-books on which the system had received training.
Meta obtained training data from the Reddit platform, according to the company filings. As previously reported in March 2024, Chaya Nayak, Meta’s Director of Product Management for Generative AI, recommended that the company conduct an evaluation of Quora content in conjunction with licensed books and scientific publications in order to gather additional data.
“[W]e need more data,” Nayak stated.
The plaintiffs claim that Meta cross-referenced pirated books with legally available ones to determine whether licensing agreements were necessary. With the case escalating, Meta has enlisted two Supreme Court litigators from Paul Weiss to aid its legal defense.
Meta has yet to respond to requests for comment regarding the revelations.
Sharing clear, practical insights on tech, lifestyle, and business. Always curious and eager to connect with readers.
