Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CFJ Comments To US Copyright Office On AI & Copyright (Oct. 2023)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Committee for Justice (202) 270-7748

1629 K Street, N.W. – Suite 300 committeeforjustice.org


Washington, DC 20006 @CmteForJustice

U.S. COPYRIGHT OFFICE


LIBRARY OF CONGRESS

Re: Notice of Inquiry and Request for Comments


on Artificial Intelligence and Copyright
DOCKET NUMBER: 2023-6
SUBMITTED: OCTOBER 30,, 2023

CURT LEVEY
PRESIDENT
STICE
THE COMMITTEE FOR JUSTICE

Introduction

The Committee for Justice (CFJ) thanks the U.S. Copyright Office for the opportunity to
comment on this important notice of inquiry regarding artificial intelligence and copyright.

In sum, CFJ recommends that Congress address the issues raised by recent federal lawsuits
alleging that the developers of gen
generative AI models have infringed the copyrights on material
used to train these models. Analogizing to the application of copyright law to humans
hum who learn
to create using examples drawn from copyrighted materials, we recommend to policymakers
and the courts that protection for copyright holders be focused on the outputs of generative AI
models, asking whether they are substantially similar to copyright-protected
protected training materials.

Founded in 2002, CFJ is a nonprofit legal and policy organization that promotes,
promotes and educates
the public and policymakers about
about, the rule of law and the benefits of constitutionally limited
government. As part of thiss mission, CFJ advocates in Congress, the courts, and the news
media about a variety of law and technology issues, encompassing intellectual property,
property
antitrust law and competition polic
policy, administrative law and regulatory reform, artificial
intelligence, free speech, data privacy
privacy,, and the impact of all these on innovation and economic
growth.

Curt Levey, the author of these comments, is an attorney with a previous career as an artificial
intelligence scientist, as well as legal and policy experience in intellectual property, including
copyright. Recently, he co-authored
authored Supreme Court amicus briefs in two copyright cases,
Google v. Oracle (2021) and Andy ndy Warhol Foundation v. Goldsmith (2023).

Prior to attending Harvard Law School (J.D. 1997), Mr. Levey studied artificial intelligence at
Brown University, earning undergraduate and Master’s degrees in computer science. His
Master’s thesis involved research
arch and development of an AI system that reasoned about the
temporal relationships between events. Upon graduation from Brown University, he built

Page | 1
machine learning models as a research assistant to James Anderson, a professor of cognitive
science and brain science at Brown.

After Brown University, Levey joined Hecht-Nielsen Neurocomputer Corp. (later "HNC
Software"), an AI startup company in San Diego, CA, where he worked for five years as a staff
scientist, designing and building numerous AI models and tools, training others to do the same,
and writing about machine learning. While at HNC, in anticipation of the obstacles posed to AI
adoption by transparency and accountability concerns, he invented and implemented a
patented1 technology that provided explanations and confidence measures for the decisions
made by neural networks. The technology was incorporated in a variety of AI applications and
saw worldwide use in an AI system that detects payment-card fraud.

More recently, Mr. Levey has written about AI policy2 and organized and moderated3, as well as
participated in4, panels on the problem of bias in AI. On May 2 of this year, he testified at the
U.S. Copyright Office’s listening session on the use of artificial intelligence to generates works
of visual art5.

Questions addressed

Our comments primarily address Question 5 in the request for comments, which asks “Is new
legislation warranted to address copyright or related issues with generative AI? If so, what
should it entail?” Our comments should be taken to suggest not just new legislation, but also
new policy generally, whether it takes the form of statutory changes, regulations, or forthcoming
judicial interpretations of existing copyright law as courts struggle with the novel legal issues
presented by recent litigation concerning generative AI. These comments also touch on the
subject matter of Questions 7.1, 7.2, 8, 9.2, 22, 23, 27, and 34.

Our comments focus on a subset of AI technologies, namely generative AI models. These


models produce complex output patterns – typically images, audio, video, or text – after being
trained on datasets consisting of training materials of the same data type.

1
https://patents.justia.com/patent/5398300
2
See, e.g., https://www.wsj.com/articles/algorithms-with-minds-of-their-own-1510521093;
https://thehill.com/opinion/technology/444568-congress-can-bring-the-government-into-the-age-of-
artificial-intelligence;
https://thehill.com/opinion/technology/432093-american-artificial-intelligence-strategy-offers-promising-
start
3
https://fedsoc.org/events/artificial-intelligence-and-bias
4
https://fedsoc.org/events/artificial-intelligence-anti-discrimination-bias; https://regproject.org/event/live-
podcast-is-artificial-intelligence-biased-and-what-should-we-do-about-it
5
https://www.copyright.gov/ai/agenda/2023-Visual-Arts-Agenda.pdf

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 2


Comments

Issues surrounding the use of copyrighted images to train generative AI models have been at
the forefront of legal and public debate over the past year due to the filing of federal lawsuits
claiming that the developers of generative AI models – including image generators and large
language models – have infringed the copyrights on material contained in the training datasets.
These datasets are used (some say “ingested”) by the models as part of the machine learning
process. The complaint in Getty Images v. Stability AI (filed 2023), which targets the Stable
Diffusion model, is typical. As the instant request for comments states, the complaint “alleges
both infringement based on use of copyrighted images to train a generative AI model and on the
possibility of that model generating images ‘highly similar to and derivative of’ copyrighted
images.”

The training materials at issue in these lawsuits are publicly available on the internet. To the
extent that AI model builders want to use curated datasets or any other data that is not publicly
available, their only lawful option is to contract for access to the data, and thus copyright
infringement is not an issue.

In the short term, courts will have to resolve the issue of copyright infringement by training
materials. In the longer term, Congress should step in and clarify the law – presumably by
amending the Copyright Act of 1976 – like it has done before when new technology has resulted
in difficult new issues under copyright law. The approach to generative AI and copyright
protection recommended in these comments should guide any statutory changes enacted by
Congress, but it is also relevant to federal regulators and the courts, as they struggle with novel
legal issues.

Similarity between human and machine learning

Whether we are talking about the courts, which must be guided by existing statutes and
precedent, or Congress, the good news is that there is no need to reinvent the wheel when
approaching the question of copyright infringement by generative AI models. The similarities
between machine learning in generative AI models and the way in which human creators learn
from a lifetime of examples makes it possible to apply the law governing the latter to the former.

In order to understand the similarities, consider that human creators learn a skill – painting
images or writing music, for example – from countless examples of art, music, and the like.
They are not born with that skill. As a human learns from each new example, the synaptic
strengths between neurons in their brains are slightly modified to reflect the new learning.

Similarly, neural networks, the technology underlying generative AI models, learn to generate
images, music, and the like by being presented with a very large number of examples. Neural
networks consist of analogues to biological neurons and synapses, typically simulated in
software. As with humans, learning takes place as the synaptic strengths (“weights”) are slowly
modified in response to training examples.

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 3


There seems to be a common misconception that a trained generative AI model retains copies,
in some form, of the individual training materials. Take, for example, the complaint in Andersen
v. Stability AI (filed 2023), alleging copyright infringement by popular image-generating AI
models. The attorneys for the plaintiffs, who represent various plaintiffs in several of the other
leading legal challenges to generative AI, assert that “By training Stable Diffusion on the
Training Images, Stability caused those images to be stored at and incorporated into Stable
Diffusion as compressed copies."

In actuality, a trained AI model retains no copies of the training materials. Retaining specific
examples would be at odds with the objective of machine learning, which is to generalize from
rather than remember the training materials. In fact, it is this aspect of neural networks, learning
subtle relationships and patterns that are encoded in the interplay of the model’s weights, that
makes the behavior of AI models so hard to control and explain.

Again, this is similar to how humans learn. While humans are capable of remembering a limited
number of the specific examples they learn from, any sort of deep learning – whether it involves
learning a creative skill or just recognizing the difference between a dog and a wolf – requires
generalization from examples.

Focus on the outputs

When a human learns from ingesting examples in the domain they are trying to master, much of
that material is often protected by copyright. Yet that ingestion goes virtually unchallenged,
being regarded as either fair use or not subject to the Copyright Act at all. However, if that
human uses their learning to reproduce or produce a derivative work from one of the
copyrighted examples without authorization – or indeed from any copyrighted material – they
are liable for copyright infringement unless they have a fair use defense.

To put it another way, when assessing copyright infringement by humans, we typically look at
the potentially infringing work the human produced rather than the process used to produce it,
whether the process is the examples he learned the relevant skills from or the tools he used to
produce the work.

We suggest a similar approach to assessing copyright infringement by generative AI models –


that is, a focus on the outputs of the model. Where an output is “substantially similar” to any of
the copyright-protected training materials – or any copyrighted work for that matter – it should
be treated as a derivative work, subject to liability for infringement.6

6
The question of which person or entity should be liable – the developers of the model, the developers of
the training dataset, the user whose prompt resulted in the infringing output, or someone else – is beyond
the scope of these comments.

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 4


On the other hand, the mere use of copyrighted material in a training dataset should be
presumed to be fair use rather than copyright infringement. The presumption of fair use can,
nonetheless, be rebutted by showing that there was copying and retention of copyrighted
training materials beyond that necessary to train the AI model (more on this later).

Why rely on human-based policies?

This is a good point at which to stop and ask why should we analogize to human learning and
creation in determining how to apply copyright law to potential infringement by generative AI.
One reason is that the analogy allows policy makers to more easily adapt existing law when
faced with this new technology. Dealing with the many legal and policy challenges reflected in
the numerous questions listed in the instant request for comments is hard enough without
reinventing the wheel. Congress and the Copyright Office should use the similarity between
human and machine learning to their advantage to guide policy development.

Similar reasoning should govern the courts as they tackle the legal issues presented by recent
litigation concerning generative AI. In fact, courts are required to apply existing law rather than
creating new policies. When faced with relatively novel legal issues resulting from new
technology, courts rely on analogies. For example, while the Fourth Amendment’s guarantee of
privacy in one’s “persons, houses, papers” does not apply directly to mobile phones and email
accounts, the courts have analyzed privacy in these newer domains by analogizing to homes,
documents, and the like. So too, the courts should analogize to human learning when
addressing generative AI.

There is also much to be said for intellectual consistency. While it may trouble many people
philosophically to recognize the similarities between human and machine learning, the burden
should be on those who want to treat human and machine learning as incomparable
phenomena for copyright purposes, despite the similarities. Considering that the distinction
between human and machine cognition will surely narrow as AI technology progresses, policies
that downplay the similarities are unlikely to stand the test of time.

Finally, consider that the analogy between human and machine learning can work in both
directions. Policies that treat the ingestion of copyrighted material for learning purposes as
infringement when performed by a generative AI model may someday blur the lawfulness of the
same use of copyrighted material for human learning, especially as technology narrows the
distinction between the two.

Additional protection against derivative works

Even putting aside the human analogy, the approach taken by recent lawsuits – that is, defining
the use of copyrighted materials to train generative AI models, without more, as copyright
infringement – is unlikely to be a robust solution to protecting the rights of creators. But before
discussing this in more detail, we discuss how the alternative approach – that is, protecting

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 5


copyrights by focusing on a model’s outputs – can be strengthened by policymakers.
Specifically, Congress should make two changes to the Copyright Act.

Congress should specify that the transformative nature of a derivative work, which weighs in
favor of finding that the work is fair use, is a factor not applicable to works produced by
generative AI, unless the potential infringer can show that the transformative quality is the
product of a human’s prompts. This modification to existing law makes sense when we consider
that a transformative work is one with “a further purpose or different character” than the original
work, as the Supreme Court explained in Andy Warhol Foundation.

Determining the purpose of a derivative work presumes intent on the part of its creator. So too
for a different character that is intentional. As impressive as generative AI models are, it is not
claimed that they have anything akin to human intent with regard to the nature of the outputs
they produce. Therefore, these models cannot genuinely meet the Supreme Court’s definition of
a transformative work.

A second statutory change Congress should make is to specify that the threshold of “substantial
similarity” necessary to find that a work is derivative should be lower when the work is produced
by generative AI, if there is evidence that the original work was in the training dataset. This
makes sense because of the possibility that the presence of the original work among the
training materials indirectly contributed – by influencing the relationships learned by the neural
network – to the substantial similarity, rather than it being a virtual coincidence.

A focus on the inputs to AI models is not a robust solution

Recall that machine learning, like human learning, does not depend on the retention of materials
in the training dataset. Once those materials are used to train the weights of the neural network,
they can be discarded without any degradation in the performance of a generative AI model.
While more permanent storage of training materials is common – whether by AI model builders
or dataset curators – and can be an independent basis for copyright infringement, this presents
a separate issue from whether the use of copyrighted material merely for training is, in and of
itself, copyright infringement. For training purposes, temporary copying is all that is
fundamentally necessary.

In pursuit of protecting one’s copyright, pointing to temporary copying by the developers of AI


models is a weak peg to hang one’s hat on. For one thing, it is not fundamentally different from
what humans do when they learn from examples. For convenience sake, humans often
intentionally make copies of the materials – music, art, articles – that they use as learning
examples. And even when they don’t, their computers make temporary copies of the songs,
images, and text they use to master a skill. Yet all of that copying is generally accepted as fair
use.

As the Electronic Frontier Foundation has pointed out:

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 6


“‘Temporary copying’ of data is fundamental to how computing works in general,
especially on the Internet. For example, … browser cache files are stored on servers to
speed up the loading of websites, and copies of visited pages are stored in a temporary
Internet files folder on your hard drive, speeding up the loading process for those
websites the next time you visit them.”7

In recognition of this fact, as EFF notes, U.S. courts have generally held that temporary copying
is either not subject to the Copyright Act or is fair use.

It is worth noting that developing a generative AI model doesn’t truly require even a temporarily
copied training dataset. If copying made the developers liable for infringement, they could
instead construct a training process that consisted of scrolling through the publicly available
training materials stored on the internet, rather than gathering those materials into a dataset.
While it would make for a slower training process, the point is that little more than trivial copying
is fundamentally necessary for training, such that those seeking copyright protection would be
advised to hang their hats elsewhere.

Another reason why a focus on the copying of training materials is a misguided strategy is that
is not aimed at the real threat to copyright holders. Consider that prior to the emergence of
generative AI, more simple neural network models – typically classification and scoring models
– were trained on large amounts of data (numerical, visual, acoustic, textual, and the like), some
portion of which was copyrighted. Yet there was little or no objection from the copyright holders.
The recent explosion in protests and lawsuits by copyright holders is motivated largely by the
fear that the outputs of generative AI will negatively affect the potential market for their
copyrighted works.

While that is a legitimate concern and is, in fact, the fourth factor in fair use analysis, it is a
concern focused entirely on the outputs of generative AI. It would be a concern to human
creators whether or not their works were used to train generative AI. To the extent that copyright
holders focus on the copying of copyrighted materials for training, they are flailing at a
peripheral issue and distracting policymakers and the courts from the real competitive threat
posed by the outputs of generative AI.

Other Protection for Training Materials

Though the use of copyrighted materials to train AI models should be treated as fair use, there
are other means for allowing the owners of potential training materials to seek compensation or
proper attribution or to opt out of having their works used for training. Congress should work to
strengthen those means.

7
https://www.eff.org/files/filenode/temporary_copies_fnl.pdf

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 7


Because the training materials use in building generative AI models come primarily from
internet-crawled material,8 the most promising avenue for addressing the rights of the owners is
1) better use of website terms and conditions agreements, including "clickwrap" agreements
that require site users to explicitly agree to the terms and conditions, which can include
respecting “Do Not Train” tags, and 2) stronger enforcement when those agreements are
breached. Congress should act to provide stronger enforcement mechanisms, as well as to
facilitate the use of collective licensing schemes.

8
See https://www.brookings.edu/articles/the-politics-of-ai-chatgpt-and-political-bias/

The Committee for Justice Contact: 202.270.7748 | contact@committeeforjustice.org Page | 8

You might also like