Publishing giants and generative AI companies are entering into agreements that aim to both protect copyrights and meet the rapidly growing needs of the AI industry.
US publishing giant HarperCollins has struck a deal with an unnamed technology company that allows it to use some of its books to train generative AI models.
In a letter seen by AFP, the tech company offers to pay $2,500 for each book selected to train its so-called large language model (LLM) for up to three years.
Artificial intelligence models need vast amounts of texts to train their daily use of language.
"HarperCollins has reached an agreement with an artificial intelligence technology company to allow limited use of select backlist nonfiction titles to train artificial intelligence models to improve model quality and performance," the publisher said in a statement.
It said the agreement has a "limited scope and clear guardrails for model output that respect author rights".
Authors "have the option to opt in or opt out of the agreement," it added.
The proposal was received with mixed feelings in the publishing world, with writers such as Daniel Kibblesmith refusing out of hand.
"I'd probably do it for a billion dollars. I'd do it for a sum of money that wouldn't require me to work anymore, since that's the ultimate goal of this technology," the author wrote on the social network Bluesky.
HarperCollins is one of the largest publishers to have reached such an agreement, but not the first.
US scholarly publisher Wiley said it had allowed "access to previously published academic and professional book content for specific use in teaching LLM models" in a $23 million contract with an unidentified "major technology company".
The agreements highlight the tensions behind artificial intelligence models that aggregate vast amounts of content on the web, creating the risk of widespread copyright infringement.
Giada Pistilli, head of ethics at Hugging Face, a French-American open access AI platform, said these agreements are a step forward as they include payments to publishers. But she regrets that they leave little room for authors to negotiate.
"What we're going to see is a mechanism of bilateral agreements between new technology companies and publishers or copyright holders, whereas I think we need a broader conversation that includes a bit more stakeholders," she said.
Julien Chouraqui, legal director of the French Publishers' Union (SNE), said the agreements represented "progress".
"An agreement means that there has been dialogue and a willingness to strike a balance between the use of copyrighted source data and that which will generate value," he said.
The press is also organizing to meet the challenges created by artificial intelligence.
In late 2023, the New York Times sued OpenAI, the creator of ChatGPT, and Microsoft, its main investor, for violating copyright protections. Other media groups have made deals with OpenAI.
Tech companies may have no choice but to pay to improve their products, especially as they begin to run out of new material to power their models.
"There are a lot of legal and illegal claims on the Internet, and a lot of pirated copies. This not only raises legal issues but also raises questions about data quality," Churaki told SNE.
"If we seek to develop the market on a virtuous basis, we need to involve all participants," he said. | BGNES