OpenAI says it's 'impossible' to train AI without copyrighted materials

OpenAI faces multiple lawsuits over its use of copyrighted articles, books, and art to train its generative artificial intelligence (AI) tools.

Is allowance instantly strangers applauded

OpenAI, the company behind the artificial intelligence (AI) chatbot ChatGPT, has said it would be "impossible" to train their AI tools without using copyrighted materials.

It comes as OpenAI faces multiple lawsuits related to its use of copyrighted articles, books, and art to train ChatGPT. Other AI companies face similar lawsuits.

Generative AI tools are trained on large amounts of content from the Internet which they use to analyse and learn patterns to generate new human-like content.

"Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials," OpenAI argued in written evidence submitted to the UK House of Lords last month.

The company's response as part of an inquiry into large language models (LLMs) was first reported by British newspaper The Telegraph.

OpenAI claimed that "limiting" the training data to content in the public domain "would not provide AI systems that meet the needs of today’s citizens".

It added that while the company believes "copyright law does not forbid training", it recognises "there is still work to be done to support and empower creators".

ChatGPT, which was released in November 2022, has accelerated the advance of AI tools due to its surge in popularity over the past year.

But it also has spread concerns that AI tools producing written content and artworks will result in lost jobs across multiple industries.

OpenAI responds to New York Times lawsuit

The New York Times was the latest company to file a lawsuit against OpenAI over copyright infringement, arguing that the AI company owed them "billions of dollars in statutory and actual damages".

The extensive 69-page lawsuit claims that OpenAI unlawfully used the New York Times' work to create AI systems that would compete with media companies.

OpenAI's tools generate "output that recites Times content verbatim, closely summarises it, and mimics its expressive style, as demonstrated by scores of examples," the lawsuit argues.

One example in the lawsuit shows a text from GPT-4 that closely resembled a Pulitzer-prize-winning 2019 investigation by the New York Times into the taxi industry.

The lawsuit emphasises that these tools have also been extremely lucrative for OpenAI and Microsoft, which is its largest investor.

OpenAI responded this week in a separate blog post addressing the US newspaper's lawsuit, arguing that training AI models with material available on the internet is "fair use" and the New York Times case was "without merit".

It said that it has worked to have partnerships with news organisations to "create mutually beneficial opportunities" and stated that news media is a "tiny slice" of the content used to train the AI systems.

The AI company has struck deals with media companies such as the Associated Press and Axel Springer, which owns media companies Politico, Business Insider, Bild and Welt, to license their content for training.

OpenAI also argued in its blog post that it has a simple opt-out to prevent it from accessing publishers' websites.

It added that memorisation and regurgitation of training content was a "failure" of the system which is meant to apply concepts to "new problems".