CFJ Comments To US Copyright Office On AI & Copyright (Oct. 2023)
CFJ Comments To US Copyright Office On AI & Copyright (Oct. 2023)
CFJ Comments To US Copyright Office On AI & Copyright (Oct. 2023)
CURT LEVEY
PRESIDENT
STICE
THE COMMITTEE FOR JUSTICE
Introduction
The Committee for Justice (CFJ) thanks the U.S. Copyright Office for the opportunity to
comment on this important notice of inquiry regarding artificial intelligence and copyright.
In sum, CFJ recommends that Congress address the issues raised by recent federal lawsuits
alleging that the developers of gen
generative AI models have infringed the copyrights on material
used to train these models. Analogizing to the application of copyright law to humans
hum who learn
to create using examples drawn from copyrighted materials, we recommend to policymakers
and the courts that protection for copyright holders be focused on the outputs of generative AI
models, asking whether they are substantially similar to copyright-protected
protected training materials.
Founded in 2002, CFJ is a nonprofit legal and policy organization that promotes,
promotes and educates
the public and policymakers about
about, the rule of law and the benefits of constitutionally limited
government. As part of thiss mission, CFJ advocates in Congress, the courts, and the news
media about a variety of law and technology issues, encompassing intellectual property,
property
antitrust law and competition polic
policy, administrative law and regulatory reform, artificial
intelligence, free speech, data privacy
privacy,, and the impact of all these on innovation and economic
growth.
Curt Levey, the author of these comments, is an attorney with a previous career as an artificial
intelligence scientist, as well as legal and policy experience in intellectual property, including
copyright. Recently, he co-authored
authored Supreme Court amicus briefs in two copyright cases,
Google v. Oracle (2021) and Andy ndy Warhol Foundation v. Goldsmith (2023).
Prior to attending Harvard Law School (J.D. 1997), Mr. Levey studied artificial intelligence at
Brown University, earning undergraduate and Master’s degrees in computer science. His
Master’s thesis involved research
arch and development of an AI system that reasoned about the
temporal relationships between events. Upon graduation from Brown University, he built
Page | 1
machine learning models as a research assistant to James Anderson, a professor of cognitive
science and brain science at Brown.
After Brown University, Levey joined Hecht-Nielsen Neurocomputer Corp. (later "HNC
Software"), an AI startup company in San Diego, CA, where he worked for five years as a staff
scientist, designing and building numerous AI models and tools, training others to do the same,
and writing about machine learning. While at HNC, in anticipation of the obstacles posed to AI
adoption by transparency and accountability concerns, he invented and implemented a
patented1 technology that provided explanations and confidence measures for the decisions
made by neural networks. The technology was incorporated in a variety of AI applications and
saw worldwide use in an AI system that detects payment-card fraud.
More recently, Mr. Levey has written about AI policy2 and organized and moderated3, as well as
participated in4, panels on the problem of bias in AI. On May 2 of this year, he testified at the
U.S. Copyright Office’s listening session on the use of artificial intelligence to generates works
of visual art5.
Questions addressed
Our comments primarily address Question 5 in the request for comments, which asks “Is new
legislation warranted to address copyright or related issues with generative AI? If so, what
should it entail?” Our comments should be taken to suggest not just new legislation, but also
new policy generally, whether it takes the form of statutory changes, regulations, or forthcoming
judicial interpretations of existing copyright law as courts struggle with the novel legal issues
presented by recent litigation concerning generative AI. These comments also touch on the
subject matter of Questions 7.1, 7.2, 8, 9.2, 22, 23, 27, and 34.
1
https://patents.justia.com/patent/5398300
2
See, e.g., https://www.wsj.com/articles/algorithms-with-minds-of-their-own-1510521093;
https://thehill.com/opinion/technology/444568-congress-can-bring-the-government-into-the-age-of-
artificial-intelligence;
https://thehill.com/opinion/technology/432093-american-artificial-intelligence-strategy-offers-promising-
start
3
https://fedsoc.org/events/artificial-intelligence-and-bias
4
https://fedsoc.org/events/artificial-intelligence-anti-discrimination-bias; https://regproject.org/event/live-
podcast-is-artificial-intelligence-biased-and-what-should-we-do-about-it
5
https://www.copyright.gov/ai/agenda/2023-Visual-Arts-Agenda.pdf
Issues surrounding the use of copyrighted images to train generative AI models have been at
the forefront of legal and public debate over the past year due to the filing of federal lawsuits
claiming that the developers of generative AI models – including image generators and large
language models – have infringed the copyrights on material contained in the training datasets.
These datasets are used (some say “ingested”) by the models as part of the machine learning
process. The complaint in Getty Images v. Stability AI (filed 2023), which targets the Stable
Diffusion model, is typical. As the instant request for comments states, the complaint “alleges
both infringement based on use of copyrighted images to train a generative AI model and on the
possibility of that model generating images ‘highly similar to and derivative of’ copyrighted
images.”
The training materials at issue in these lawsuits are publicly available on the internet. To the
extent that AI model builders want to use curated datasets or any other data that is not publicly
available, their only lawful option is to contract for access to the data, and thus copyright
infringement is not an issue.
In the short term, courts will have to resolve the issue of copyright infringement by training
materials. In the longer term, Congress should step in and clarify the law – presumably by
amending the Copyright Act of 1976 – like it has done before when new technology has resulted
in difficult new issues under copyright law. The approach to generative AI and copyright
protection recommended in these comments should guide any statutory changes enacted by
Congress, but it is also relevant to federal regulators and the courts, as they struggle with novel
legal issues.
Whether we are talking about the courts, which must be guided by existing statutes and
precedent, or Congress, the good news is that there is no need to reinvent the wheel when
approaching the question of copyright infringement by generative AI models. The similarities
between machine learning in generative AI models and the way in which human creators learn
from a lifetime of examples makes it possible to apply the law governing the latter to the former.
In order to understand the similarities, consider that human creators learn a skill – painting
images or writing music, for example – from countless examples of art, music, and the like.
They are not born with that skill. As a human learns from each new example, the synaptic
strengths between neurons in their brains are slightly modified to reflect the new learning.
Similarly, neural networks, the technology underlying generative AI models, learn to generate
images, music, and the like by being presented with a very large number of examples. Neural
networks consist of analogues to biological neurons and synapses, typically simulated in
software. As with humans, learning takes place as the synaptic strengths (“weights”) are slowly
modified in response to training examples.
In actuality, a trained AI model retains no copies of the training materials. Retaining specific
examples would be at odds with the objective of machine learning, which is to generalize from
rather than remember the training materials. In fact, it is this aspect of neural networks, learning
subtle relationships and patterns that are encoded in the interplay of the model’s weights, that
makes the behavior of AI models so hard to control and explain.
Again, this is similar to how humans learn. While humans are capable of remembering a limited
number of the specific examples they learn from, any sort of deep learning – whether it involves
learning a creative skill or just recognizing the difference between a dog and a wolf – requires
generalization from examples.
When a human learns from ingesting examples in the domain they are trying to master, much of
that material is often protected by copyright. Yet that ingestion goes virtually unchallenged,
being regarded as either fair use or not subject to the Copyright Act at all. However, if that
human uses their learning to reproduce or produce a derivative work from one of the
copyrighted examples without authorization – or indeed from any copyrighted material – they
are liable for copyright infringement unless they have a fair use defense.
To put it another way, when assessing copyright infringement by humans, we typically look at
the potentially infringing work the human produced rather than the process used to produce it,
whether the process is the examples he learned the relevant skills from or the tools he used to
produce the work.
6
The question of which person or entity should be liable – the developers of the model, the developers of
the training dataset, the user whose prompt resulted in the infringing output, or someone else – is beyond
the scope of these comments.
This is a good point at which to stop and ask why should we analogize to human learning and
creation in determining how to apply copyright law to potential infringement by generative AI.
One reason is that the analogy allows policy makers to more easily adapt existing law when
faced with this new technology. Dealing with the many legal and policy challenges reflected in
the numerous questions listed in the instant request for comments is hard enough without
reinventing the wheel. Congress and the Copyright Office should use the similarity between
human and machine learning to their advantage to guide policy development.
Similar reasoning should govern the courts as they tackle the legal issues presented by recent
litigation concerning generative AI. In fact, courts are required to apply existing law rather than
creating new policies. When faced with relatively novel legal issues resulting from new
technology, courts rely on analogies. For example, while the Fourth Amendment’s guarantee of
privacy in one’s “persons, houses, papers” does not apply directly to mobile phones and email
accounts, the courts have analyzed privacy in these newer domains by analogizing to homes,
documents, and the like. So too, the courts should analogize to human learning when
addressing generative AI.
There is also much to be said for intellectual consistency. While it may trouble many people
philosophically to recognize the similarities between human and machine learning, the burden
should be on those who want to treat human and machine learning as incomparable
phenomena for copyright purposes, despite the similarities. Considering that the distinction
between human and machine cognition will surely narrow as AI technology progresses, policies
that downplay the similarities are unlikely to stand the test of time.
Finally, consider that the analogy between human and machine learning can work in both
directions. Policies that treat the ingestion of copyrighted material for learning purposes as
infringement when performed by a generative AI model may someday blur the lawfulness of the
same use of copyrighted material for human learning, especially as technology narrows the
distinction between the two.
Even putting aside the human analogy, the approach taken by recent lawsuits – that is, defining
the use of copyrighted materials to train generative AI models, without more, as copyright
infringement – is unlikely to be a robust solution to protecting the rights of creators. But before
discussing this in more detail, we discuss how the alternative approach – that is, protecting
Congress should specify that the transformative nature of a derivative work, which weighs in
favor of finding that the work is fair use, is a factor not applicable to works produced by
generative AI, unless the potential infringer can show that the transformative quality is the
product of a human’s prompts. This modification to existing law makes sense when we consider
that a transformative work is one with “a further purpose or different character” than the original
work, as the Supreme Court explained in Andy Warhol Foundation.
Determining the purpose of a derivative work presumes intent on the part of its creator. So too
for a different character that is intentional. As impressive as generative AI models are, it is not
claimed that they have anything akin to human intent with regard to the nature of the outputs
they produce. Therefore, these models cannot genuinely meet the Supreme Court’s definition of
a transformative work.
A second statutory change Congress should make is to specify that the threshold of “substantial
similarity” necessary to find that a work is derivative should be lower when the work is produced
by generative AI, if there is evidence that the original work was in the training dataset. This
makes sense because of the possibility that the presence of the original work among the
training materials indirectly contributed – by influencing the relationships learned by the neural
network – to the substantial similarity, rather than it being a virtual coincidence.
Recall that machine learning, like human learning, does not depend on the retention of materials
in the training dataset. Once those materials are used to train the weights of the neural network,
they can be discarded without any degradation in the performance of a generative AI model.
While more permanent storage of training materials is common – whether by AI model builders
or dataset curators – and can be an independent basis for copyright infringement, this presents
a separate issue from whether the use of copyrighted material merely for training is, in and of
itself, copyright infringement. For training purposes, temporary copying is all that is
fundamentally necessary.
In recognition of this fact, as EFF notes, U.S. courts have generally held that temporary copying
is either not subject to the Copyright Act or is fair use.
It is worth noting that developing a generative AI model doesn’t truly require even a temporarily
copied training dataset. If copying made the developers liable for infringement, they could
instead construct a training process that consisted of scrolling through the publicly available
training materials stored on the internet, rather than gathering those materials into a dataset.
While it would make for a slower training process, the point is that little more than trivial copying
is fundamentally necessary for training, such that those seeking copyright protection would be
advised to hang their hats elsewhere.
Another reason why a focus on the copying of training materials is a misguided strategy is that
is not aimed at the real threat to copyright holders. Consider that prior to the emergence of
generative AI, more simple neural network models – typically classification and scoring models
– were trained on large amounts of data (numerical, visual, acoustic, textual, and the like), some
portion of which was copyrighted. Yet there was little or no objection from the copyright holders.
The recent explosion in protests and lawsuits by copyright holders is motivated largely by the
fear that the outputs of generative AI will negatively affect the potential market for their
copyrighted works.
While that is a legitimate concern and is, in fact, the fourth factor in fair use analysis, it is a
concern focused entirely on the outputs of generative AI. It would be a concern to human
creators whether or not their works were used to train generative AI. To the extent that copyright
holders focus on the copying of copyrighted materials for training, they are flailing at a
peripheral issue and distracting policymakers and the courts from the real competitive threat
posed by the outputs of generative AI.
Though the use of copyrighted materials to train AI models should be treated as fair use, there
are other means for allowing the owners of potential training materials to seek compensation or
proper attribution or to opt out of having their works used for training. Congress should work to
strengthen those means.
7
https://www.eff.org/files/filenode/temporary_copies_fnl.pdf
8
See https://www.brookings.edu/articles/the-politics-of-ai-chatgpt-and-political-bias/