The IP in AI – What you need to know
With AI sending waves throughout the business world, we explain the salient role of intellectual property in regulating the technology and protecting the rights of inventors.
In this series, we have explored whether IP rights protect AI systems themselves and whether copyright or patents provide protection for AI-generated works or inventions, however, equally as controversial is the way in which AI systems use others’ works. At their core AI systems are computer systems, working on large volumes of data. Those systems – and most obviously those data – are often the products of others’ intellectual and economic investment. This article explores the degree of protection likely afforded to IP rights holders against unsanctioned use of this material by an AI system.
In general, copyright prevents the unauthorised use of certain categories of subject matter (for example, literary or artistic works). Although the requirements for infringement differ between jurisdictions, in general if copyright subsists in a work (which we discussed in more detail in part 3 of our series), proving infringement of that copyright requires the copyright owner to show that:
There are also a number of exceptions or defences to infringement that can apply, which differ from jurisdiction to jurisdiction. For example:
It is noteworthy that none of these jurisdictions have an equivalent to the relatively broad and flexible “fair use” doctrine that applies in the US.
Infringement by training
Central to almost any AI system is a large mass of data on which the system is trained. Although referred to as “data”, the training materials are frequently themselves original works, in which copyright subsists. For example, these may be artworks (as in Stable Diffusion), or passages of code (as in CoPilot). The process of training an AI or ML system on those inputs almost certainly involves the creation of a copy (in a copyright sense) – and most likely many copies – of those copyright works, even if those copies are only ever used “internally” within the system (eg in training the system) and never reproduced as outputs from it.
This kind of copying is among the key allegations in proceedings brought by Getty Images against Stability AI (Getty Images). Getty Images, a global media provider distributing royalty-free images, photos, music and video, has sued Stability AI in the UK and US for allegedly using over 12 million of its copyrighted images and associated captions and meta-data to train its AI text-to-image tool, Stable Diffusion, without consent or compensation. In the US, the case is in its discovery stages,12 whilst in the UK the High Court, on 1 December 2023, set the case down for trial on the basis of real prospects of success.13 The UK litigation also involves allegations of infringement of database rights, trade mark infringement and passing off, as well as copyright infringement (see our recent update on this case and other generative AI litigation worldwide here).
Authors including Jodi Picoult and George RR Martin have also sued Open AI in the US, (Authors Guild, et al. v. OpenAI, Inc.) alleging the infringement of fiction authors’ rights in the AI system’s wholesale copying of their works, without permission or compensation, to train its large language models (LLMs). They also argue that the output of these LLMs are derivative works which mimic or paraphrase the authors’ work and harm the market. The Authors Guild allege that this threatens the livelihood of authors, and most recently have joined Microsoft as a defendant. Unsurprisingly, many other groups of authors have brought separate suits against ChatGPT and Open AI based on similar concerns (including Tremblay v. OpenAI, Inc.).
Practical challenges
Although, in a legal sense, this act of infringement may be straightforward conceptually, there are practical matters that make it difficult to establish infringement:
Government responses
These issues, and the challenges they present for rights holders, are a high priority for governments worldwide. For example, the current draft of the EU AI Act, which is being negotiated between the EU Council, EU Parliament and EU Commission, contains provisions requiring transparency of training data to be mandatory such that copyright protected materials using in training an AI can be identified (see our blog post here). In addition, the EU AI Act requires general purpose AI models to make publicly available a sufficiently detailed summary of the content (including text and data protected by copyright) used for training the model.
In Australia, in December 2023, Commonwealth Attorney-General Mark Dreyfus announced the establishment of a copyright and AI reference group “to better prepare for future copyright challenges emerging from AI”, expressly referring to the need to address copyright issues concerning “the material used to train AI models” and “transparency of inputs and outputs”.
The UK House of Lords Communications and Digital Committee issued a report on LLMs and Generative AI in February 2024 (see our blog post here), which called on the UK Government to support copyright holders, saying the Government “cannot sit on its hands” while LLM developers exploit the works of rightsholders. The report expressly called for a way for rightsholders to check training data for copyright breaches, and the Committee Chair was quoted as saying:
One area of AI disruption that can and should be tackled promptly is the use of copyrighted material to train LLMs. LLMs rely on ingesting massive datasets to work properly but that does not mean they should be able to use any material they can find without permission or paying rightsholders for the privilege. This is an issue the Government can get a grip of quickly and it should do so.
In its response following the consultation on its AI Regulation White Paper published in February 2024, the UK Government did not produce the definitive solution that the House of Lords had called for, but referenced the UK IPO's failed attempts to find a solution between stakeholders over the last 18 months. Instead the response referred to further examination of ways to improve transparency of use of copyright material (see our blog post here). As a result, in the UK it may well be for the courts to determine the copyright position in the short term, although this may not be to the liking of those investing in AI development.
Aside from infringement during the training of an AI system, it may also be the case that an AI system can produce outputs that infringe copyright, in the sense that they bear sufficient objective similarity to an original work. Since this only requires a side-by-side comparison of a given output from the AI system and a given original work (rather than a forensic enquiry into whether the original work was in fact among the training data set), this kind of claim avoids some of the difficulties referred to above. However, here the difficulty is primarily in showing the requisite degree of objective similarity between a given output and a given input.
This was the primary challenge faced in GitHub and Andersen v. Stability AI Ltd. (Andersen),14 where many of the claims originally brought have been dismissed because the plaintiffs were unable to establish a specific original work that bore sufficient objective similarity to a specific output work. This difficulty is caused by a multitude of practical factors, including poor or inaccurate referencing and lack of transparency from developers, as well as the technical nature of AI systems.
This problem can be exacerbated by the “Snoopy problem” (also referred to as the “Italian plumber problem”). If the training data uses enough example images of a particular and well-known subject (such as Snoopy), or a particular style of work, it may be difficult to draw a sufficient causal link between a given output and a specific input image. This, too, is an issue in Getty Images, where one area that is being debated in relation to potential defences (the Defence has yet to be filed) is that the outputs are "inspired by" rather than directly copying the originals, since they mix elements from multiple sources. In that respect the replication of watermarks or parts of them (discussed above) may assist Getty.
Academics and software developers have recently sought to develop methods to identify whether text is generated by an AI, but these methods appear to be currently limited to text-based output, and have limited reliability and accuracy. Any input from governments to mandate, as a matter of policy, a framework for watermarking or indicating the source of an AI output will also need to consider countervailing issues including economic policy, competition, and the promotion of innovation.
Another challenge faced with these kinds of cases is the identification of the infringer. If an AI system can be used to generate an output work bearing similarity to a given input, but only when that AI system is used by a user who is determined to infringe, who is (or should be) liable for that infringement?15 In many jurisdictions the answer may be both the user and the AI system owner – the former for the primary infringement and the latter for “authorisation”, “vicarious” or “secondary” infringement. However, assessment of such “secondary” liability often requires an examination of the degree to which the AI system owner can control or prevent the allegedly infringing conduct of the user.
Aside from copyright infringement, the owners of works used to train an AI system may have other causes of action in relation to a given output. For example, even if a given input work is available on open-source licence terms, those terms may require attribution information, or require that any derivative works are licensed on terms no less open than that applying to the inputs (so-called “copyleft” licences). Indeed, the removal of attribution (or copyright management information) is part of the complaint brought by the plaintiffs in GitHub.
With the rapid development of AI systems, companies like Google, Samsung,and Microsoft led the market in terms of AI-related patent applications at the EPO in the period 2016 to 2020.16
While copyright infringement has dominated current IP litigation brought in the context of works generated by AI, there are emerging patent disputes involving AI systems. Given it is now relatively established in most jurisdictions worldwide following Dr Thaler's series of applications (see our blog post here on the UK Supreme Court decision of December 2023 in that regard) that the AI system itself is unable to be an “inventor” for the purposes of patent law (as also discussed in our previous article here), the focus has shifted to infringement of patents seeking to protect the AI system itself.
In July 2023, FriendliAI commenced proceedings in the United States District Court For The District Of Delaware against Hugging Face (FriendliAI Inc. v. Hugging Face, Inc.), who offers an inference server for Large Language Models (“LLMs”) called Text Generation Inference (“TGI”). The founder and CEO of FriendliAI is Dr. Byung-gon Chun, the inventor of PeriFlow/Orca, which utilises a system for iteration-level or dynamic “batching” which allegedly improves AI systems with more efficient and scalable serving of generative AI transformer models. This allows the AI to process multiple requests at once. Hugging Face clearly states on its website that it uses PeriFlow/Orca, which FrendliAI contends constitutes infringement of their patent entitled ‘Dynamic Batching for Inference System for Transformer-Based Generation Tasks’. This matter is in its early stages, and it will be one of the first patent infringement cases relating to an AI technology.
These patent cases, while dealing with AI subject matter, will grapple with relatively traditional patent law concepts, including construction of the patent claims, considerations of whether those claims have been exploited, as well as counterclaims attacking the patent’s validity (see our previous article on patent protection of AI systems). An example of the latter occurred in December 2023, just before the Supreme Court's decision in the DABUS/Thaler case on inventorship: the High Court of England and Wales rejected a challenge to the patentability of an AI system, relating to an autonomous neural network, which was held not to be excluded from patentability (see our blog post here). The UK IPO has been granted leave to appeal the decision to the Court of Appeal. However, in the interim, in response to the High Court's decision, the UK IPO has temporarily suspended its guidance on the examination of AI inventions while it considers the impact of this decision and has issued a practice update specifically relating to the examination of ANNs.
A common thread amongst the cases discussed above is the normative considerations associated with IP protection and enforcement in relation to materials used and produced by AI systems. These include the adequate compensation of copyright owners, lost opportunity to licence their works and market usurpation through derivative works.
Copyright holders asserting their rights, including Getty Images, have often reiterated that they do not seek to have a chilling effect on the development of AI technology, but instead are focusing on ethical sourcing of data, including compensation for copyright holders, consent (including by exploring opt-out models), and an opportunity to licence. These issues have been behind the debates worldwide over regulation of AI and attempts to balance opportunity with equity.
At the same time, organisations hosting large volumes of data are realising the potential value of those data to new and upcoming AI systems and putting in place systems to protect them. Reddit, for example, has announced that it plans to charge companies for accessing its application programming interface (which is used by external entities to download conversations from the forum), even though its User Agreement confirms that users retain ownership of content they post to the platform.
Outside of the strict bounds of the law, developers of AI systems may also begin to see the fair and ethical sourcing of their input data as forming a part of their ESG public image and "social licence to operate". In late 2023, for example, Canva announced a commitment not to train its proprietary AI models on its creators’ content without express permission, and established a $200 million compensation program for creators who consent to having their content used to train those models.
The growing frequency of attempts to regulate these issues – and disputes arising from them – demonstrate the challenge IP law is currently contending with in striking the balance between encouraging investment in AI technologies and protecting investments that have already been made in the material being used to train them. The legal reforms and market restrictions that might lead to this balance are yet to be implemented, but the results of the various disputes around the world may help to illustrate the difficulties of the current position and provide added impetus towards an international solution to an international problem.
For more on the developing area of intellectual property protection and risks for AI and ML systems, follow the AI in IP series on our IP blog.
With AI sending waves throughout the business world, we explain the salient role of intellectual property in regulating the technology and protecting the rights of inventors.
The contents of this publication are for reference purposes only and may not be current as at the date of accessing this publication. They do not constitute legal advice and should not be relied upon as such. Specific legal advice about your specific circumstances should always be sought separately before taking any action based on this publication.
© Herbert Smith Freehills 2024
We’ll send you the latest insights and briefings tailored to your needs