Thomson Reuters CPO David Wong tells SiliconRepublic.com why AI systems should respect copyright for their own future benefit.
Big Tech companies and content creators are increasingly at odds with each other as the generative artificial intelligence (GenAI) race continues to intensify in the few years since it hit mainstream. And at the core of the issue is AI technology’s growing role in redefining copyright and fair use.
General purpose large language models (LLMs) such as OpenAI’s ChatGPT, Anthropic’s Claude and Google’s Gemini are built on an extremely large corpus of training data which includes anything and everything – from published books and articles to website data and social media posts. However, copyright holders are generally not asked for permission or compensated when companies take their work to train AI models.
Unsurprisingly, content creators have a problem with this. Over the years, publishers, artists and several stakeholders launched legal battles against ‘Big AI’ over the very same issues. In 2023, The New York Times filed a copyright lawsuit against OpenAI, while Nvidia was sued last year by a trio of authors and the Canadian AI start-up Cohere was sued by more than a dozen top news publishers in February this year. And that’s only to name a few.
However, its hard to prove copyright infringement, as two news media outlets found out last year. Raw Story Media and AlterNet Media filed a legal complaint against OpenAI in early 2024, claiming that in an “extensive review” of publicly available information, they found “thousands” of their copyrighted works included in OpenAI’s data sets.
Although, a US court dismissed the lawsuit for being unable to prove any “concrete injury”. The judge ruled that the likelihood that ChatGPT, an AI model trained on large swaths of data, would output plagiarised content from one of the plaintiff’s article is “remote”.
AI companies often rely on the fair use argument, claiming that they do not reproduce content, rather they analyse and transform it. On the other hand, those who hold copyright often claim theft. The constant back and forth begs the question – is this sustainable?
The Canadian multinational Thomson Reuters is best known as the parent to Reuters news agency. However, the content-driven company also provides AI SaaS offerings for its large B2B clientele. Its chief product officer David Wong explains why AI models need to respect copyright.
Copyright creates accessibility
In recent months, the US has softened its approach when it comes to regulating AI and Big Tech – an approach closer to de-regulation.
In his first week in office, president Donald Trump repealed the country’s existing AI policy which was aimed at setting up guardrails around the developing technology.
In its place, the Trump administration wants a new ‘AI Action Plan’, which will likely be shaped by the very companies the policy might police.
And last month, two of the biggest AI companies, OpenAI and Google, sent in their proposals for the policy. Unsurprisingly, the companies advocated for looser laws.
Google said that copyright and privacy laws can “impede appropriate access to data”, which it deems necessary for training leading AI models.
“Balanced” copyright rules such as fair use and text and data mining exceptions “have been critical to enabling AI systems to learn from prior knowledge”, it argued.
Meanwhile, OpenAI wants a copyright strategy that “promotes the freedom to learn” – one that would “extend the system’s role into the intelligence age”.
The ChatGPT-maker says that its models are trained not to replicate work, but rather to extract patterns, linguistic structures and contextual insights.
“This means our AI model training aligns with the core objectives of copyright and the fair use doctrine,” it claimed. Claims that the company is currently battling out to prove in court.
However, Wong says that copyright law is why content exists for free. Among his many roles at Thomson Reuters, Wong leads product design and management, including the company’s many AI models. He says that copyright mechanisms are “important just to be able to maintain the marketplace”.
Describing a “free-for-all” scenario where copyright laws don’t exist, he explains that the market’s natural response would be to “throw up walls”.
“It would be in the interest of anybody that produces content to try to protect their economic interests and to not put it out there freely.” So, he asks, why would anyone agree for their content – paywalled, or not – to be taken without compensation by other businesses?
“The irony of the [AI policy] proposal is that it would actually make innovation harder because access to content would become more difficult.”
Copyright holders should be given an incentive then, Wong explains, to create a “motivation” to produce more.
If content production becomes less lucrative, why would content creators output their work for free? And how would AI models, which require knowledge to train, be developed?
Creating actually useful AI systems
Wong says that a collaboration between those who produce content and those who consume it can create an economic system where everybody wins.
“Our position is that copyright…ultimately creates a more productive ecosystem where those that produce content…are fairly compensated and their efforts are treated with the value that they have in the products they ultimately support.”
Earlier this year Google signed a deal with Associated Press to deliver its content to Google’s Gemini AI.
On the other hand, a useful AI system needs to be able to explain its reasoning and provide citations, Wong says.
AI models like ChatGPT are trained on a large amount of diverse datasets. While this results in a model able to handle a wide variety of tasks, the general and unspecified nature of its training data also makes it more prone to hallucinations – rendering them impractical for professional work unless tweaked.
But purpose-built LLMs are designed for specific industries or tasks, and they are generally trained on domain-specific data. Thomson Reuters creates many such models, including Westlaw, a legal research assistant, as well as tax and accounting research tool among others.
To build these models, the company uses retrieval augmented generation, or RAG for short. RAG is a popular AI training technique where data sources and citations are used as part of an AI model’s corpus. This increases the accuracy and reliability of GenAI models, while reducing hallucinations.
Interestingly, Thomson Reuters won a first-of-its-kind fair use summary judgement against Ross Intelligence, a competing legal AI search engine builder. The multinational, in its 2023 lawsuit, claimed that Ross made use of its Westlaw AI search engine, which indexes material that is not copyrightable, to build its own competing search tool.
Later the same year, the presiding judge held that there were factual errors to whether Westlaw’s headnotes were sufficient to warrant copyright protection. However, the judgement was changed earlier in February when the court held that headnotes were sufficiently original to warrant copyright protection.
In addition, Thomson Reuters is also working on developing proprietary AI models, training them using its own data. “We’ve done this because we want to see what’s the extent of how far the technology can go,” Wong says. “We want professional work to be transformed by AI.”
Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.