Authors file copyright infringement against OpenAI, claiming books used as training material for ChatGPT
It alleged that large portions of the copyrighted material was allegedly used to train ChatGPT to enable it to “emit convincingly naturalistic text outputs in response to user prompts,” according to the complaint.
The writers behind books like “The Cabin at the End of the World” and “13 Ways of Looking at a Fat Girl” have filed a copyright infringement lawsuit against OpenAI, claiming software engineers copied massive amounts of text from the novels as training material for ChatGPT.
The authors Paul Tremblay and Mona Awad filed the class action complaint in a U.S. District Court for the Northern District of California against OpenAI and its holding companies for direct copyright infringement, vicarious copyright infringement, violations of section 1202(b) of the Digital Millennium Copyright Act, unjust enrichment, violations of the California and common law unfair competition laws, and negligence.
The Joseph Saveri Law Firm in San Francisco and Matthew Butterick, an attorney based in Los Angeles, filed the lawsuit on behalf of the authors. They did not immediately respond to a request for comment.
The complaint alleges that OpenAI used Tremblay’s “The Cabin at the End of the World” and Awad’s ”13 Ways of Looking at a Fat Girl and Bunny,” as a training tool to help its artificial intelligence programs, like ChatGPT, emit natural language and conversation.
It alleged that large portions of the copyrighted material was allegedly used to train ChatGPT to enable it to “emit convincingly naturalistic text outputs in response to user prompts,” according to the complaint.
“Many kinds of material have been used to train large language models. Books, however, have always been a key ingredient in training datasets for large language models because books offer the best examples of high-quality longform writing,” the complaint said.
In a June 2018 paper introducing GPT-1, OpenAI revealed on BookCorpus, “a collection of ‘over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance,’” because a dataset of books “‘Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information,’” the complaint cited.
In July 2020, OpenAI disclosed that 15% of GPT-3′s training dataset came from “‘two internet-based books corpora’” that it called “‘Books1′” and “‘Books2,’” though the defendant never revealed what books are part of the data set. However, the complaint said Books1 may contain about 63,000 titles and Books2 would have about 294,000 titles.
However, in March, OpenAI’s paper introducing GPT-4 “contained no information about its dataset at all,” the complaint said.
OpenAI offers ChatGPT through a web interface for $20 per month. Users can pick the GPT-3.5 model or a newer GPT-4 model, but both allows a user to enter questions, commands, or to even ask ChatGPT to summarize a copyrighted book, the complaint said.
“On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data,” the complaint said.
“When ChatGPT was prompted to summarize books written by each of the Plaintiffs, it generated very accurate summaries. … The summaries get some details wrong. These details are highlighted in the summaries. This is expected, since a large language model mixes together expressive material derived from many sources. Still, the rest of the summaries are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content. At no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works.”