“Stolen Books,” Bad Faith, and Fair Use
cross posted at Harvard’s Fair Use Week Blog.
Artificial intelligence is sure to be the hottest topic of this year’s Fair Use Week, and that hotness is well-deserved. It’s startling when a machine can instantly create written or visual works that would ordinarily require a skilled human writer or artist.
Fair use analysis is (famously) case-by-case, and the outcome of a fair use analysis for any particular AI technology will depend on how that technology works and (especially) the nature of its outputs and the purposes it serves. But we know from the Google Books and HathiTrust cases that some unlicensed computer processing of large datasets of in-copyright works is clearly fair use. Some AI technologies are sure to pass the fair use test from those cases, all else equal. But there is one interesting difference between HathiTrust and Google Books on one hand, and some of the AI tools being sued on the other: the books used in the former cases were lawfully owned by libraries and scanned with the libraries’ consent. It’s not clear that the AI companies have obtained all of their data with as clear a pedigree.
Indeed, one of the author class action lawsuits over AI argues that the datasets used to train some artificial intelligence tools are comprised partly or entirely of material of apparently dubious origin. As The Verge reports, the plaintiffs claim that some of the AI training data “were acquired from ‘shadow library’ websites like Bibliotik, Library Genesis, Z-Library, and others, noting the books are ‘available in bulk via torrent systems.’” Does this matter for the fair use calculus? Should it?
If it does, the upshot could further curtail fair use in circumstances where it’s already constrained, tilting the balance of copyright further in favor of control by entertainment industries and against authors, the First Amendment, and the public interest.
The idea that “fair use presumes good faith and fair dealing” has a seemingly strong pedigree, with an endorsement from the Supreme Court in Harper & Row v. Nation, a case where the Court denied fair use in part because the user (The Nation magazine) had “scooped” the copyright holder by publishing key revelations from a forthcoming book before its release date, relying on a “purloined” copy of the manuscript.
That pedigree is demolished in “Bad Faith and Fair Use,” by Simon Frankel and Matt Kellogg, which shows methodically how every source of authority for a “good faith” requirement quoted in the Harper opinion is either misconstrued or is itself without any basis in the law or history of copyright. The Harper court was simply wrong on the law. The article also explains how the Supreme Court’s subsequent fair use opinion in Campbell v. Acuff-Rose could be read to roll back Harper’s mistake at least partially, but bemoans that the Campbell court mostly dodged the issue.
Frankel and Kellogg also give a series of compelling policy arguments for keeping questions of “good faith,” including questions about how the user accessed the work they used, out of the fair use calculus:
The bad faith inquiry does not serve the central goal of copyright — to increase public access to new works — and in fact does much to impede this goal. It also needlessly confuses fair use with other areas of law, makes copyright litigation more costly and less predictable, and undermines copyright’s built-in First Amendment protections.
If you’re interested in this general question, it’s a must-read article.
Another must-read is Michael Carroll’s Copyright and the Progress of Science: Why Text and Data Mining Is Lawful. In addition to giving a detailed treatment of the fair use analysis of text and data-mining, Carroll applies the arguments in Frankel & Kellog’s piece directly to using a collection like Sci-Hub for computer analysis.
To Frankel & Kellogg and Carrol’s work, I want to add an additional policy concern grounded in the emerging reality of 21st Century media distribution: copyright holders already exercise unprecedented non-copyright control over lawful access to their works.
As Aaron Perzanowski and Jason Schultz show at length in The End of Ownership, and public libraries have experienced in the broken market for ebooks, digital rentals and walled gardens are replacing old-fashioned ownership of copies, threatening the balance established by copyright. What the public can do with culture is increasingly dictated by vendors who can use technology and contracts to achieve what copyright expressly (and intentionally) does not: near total control over uses that benefit the public and do no harm to creativity.
The law gives copyright holders a panoply of remedies in cases where licenses are breached or digital protections circumvented. Those who break DRM (beyond what’s allowed by current DMCA rules) and breach licenses can be held directly liable for potentially significant damages, and varieties of secondary liability may attach to those who participate culpably in these actions. Fair use is not generally a defense to these claims (and is therefore already substantially diminished by them).
To treat a dataset’s origin as a barrier to fair use by third-parties who are not themselves guilty of breaching a license or circumventing a digital lock, or even to give it significant weight as a factor in the fair use calculus, would be a further blow to the public interest in the copyright system. Copyright grants rights holders private control to a certain extent in order to encourage creativity, but it limits that control in the public interest. Turning one-sided contracts and technical protections into fair use nullifiers against the whole world (rather than just against licensees and circumventors) would enable copyright holders to exert even more control, even in situations where the public would be better-served by fair use.