The short answer is that we don't know. The longer answer based purely on this case is that there's an argument that training is fair use and so copyleft doesn't have any impact on the model, but this is one case in California and doesn't inherently set precedent in the US in general and has no impact at all on legal interpretations in other countries.
The dearth of case law here still makes a negative outcome for FSF pretty dangerous, even if they don't appeal it and set precedent in higher courts. It might not be binding but every subsequent case will be able to site it, potentially even in other common law countries that lack case law on the topic.
And then there is the chilling effect. If FSF can't enforce their license, who is going to sue to overturn the precedent? Large companies, publishers, and governments have mostly all done deals with the devil now. Joe Blow random developer is going to get a strip mall lawyer and overturn this? Seems unlikely
No, it looks like the stance of the FSF is that models should be free as a matter of principle, the same as their stance when it comes to software. Nothing in the linked post contradicts the description that the judgement was that the training was fair use.
Where's the threat? The FSF was notified that as part of the settlement in Bartz v. Anthropic they were potentially entitled to money, but in this case the works in question were released under a license that allowed free duplication and distribution so no harm was caused. There's then a note that if the FSF had been involved in such a suit they'd insist on any settlement requiring that the trained model be released under a free license. But they weren't, and they're not.
(Edit: In the event of it being changed to match the actual article title, the current subject line for this thread is " FSF Threatens Anthropic over Infringed Copyright: Share Your LLMs Freel")
> but in this case the works in question were released under a license that allowed free duplication and distribution so no harm was caused.
FSF licenses contain attribution and copyleft clauses. It's "do whatever you want with it provided that you X, Y and Z". Just taking the first part without the second part is a breach of the license.
It's like renting a car without paying and then claiming "well you said I can drive around with it for the rest of the day, so where is the harm?" while conveniently ignoring the payment clause.
You maybe confusing this with a "public domain" license.
If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway. The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
I used to be on the FSF board of directors. I have provided legal testimony regarding copyleft licenses. I am excruciatingly aware of the difference between a copyleft license and the public domain.
> I am excruciatingly aware of the difference between a copyleft license and the public domain.
Then why did you say "no harm was caused"? Clearly the harm of "using our copylefted work to create proprietary software" was caused. Do you just mean economic harm? If so, I think that's where the parent comments confusion originates.
> The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
The restrictions fall not only on verbatim distribution, but derivative works too. I am not aware whether model outputs are settled to be or not to be (hehe) derivative works in a court of law, but that question is at the vey least very much valid.
> the district court ruled that using the books to train LLMs was fair use but left for trial the question of whether downloading them for this purpose was legal.
The pipeline is something like: download material -> store material -> train models on material -> store models trained on material -> serve output generated from models.
These questions focus on the inputs to the model training, the question I have raised focuses on the outputs of the model. If [certain] outputs are considered derivative works of input material, then we have a cascade of questions which parts of the pipeline are covered by the license requirements. Even if any of the upstream parts of this simplified pipeline are considered legal, it does not imply that that the rest of the pipeline is compliant.
Consider the net effect and the answer is clear. When these models are properly "trained", are people going to look for the book or a derivative of it, with proper attribution?
Or is the LLM going to regurgitate the same content with zero attribution, and shift all the traffic away from the original work?
When viewed in this frame, it is obvious that the work is derivative and then some.
That is your opinion, but the judge disagreed with you. The decision may have been overturned on appeal, but as it stands, in that courtroom, the training was fair use.
I can memorize a song and it will be fair use too, but it won't be anymore once I start performing it publicly. Training itself is quite obviously fair use, what matters is what happens next.
This is also, unfortunately, the only way this can be settled. Making LLM output legally a derivative work would murder the AI golden rush and nobody wants that
I'm also skeptical that it's impossible to get an LLM to reproduce some code verbatim. Google had that paper a while back about getting diffusion models to spit out images that were essentially raw training data, and I wouldn't be surprised if the same is possible for LLMs.
Stack Overflow has verbatim copied GPL code in some of its questions and answers. As presented by SO, that code is not under the GPL license (this also applies to other licenses - the BSD advertising clause and the original json will cause similar problems).
Arguably, the use of the code in the Stack Overflow question and answer is fair use.
The problem occurs not when someone reads the Q&A with the improperly licensed code but rather when they then copy that code verbatim into their own non GPL product and distribute that without adherence to the GPL.
It's the last step - some human distributing the improperly licensed software that is the violation of the GPL.
This same chain of what is allowed and what is not is equally applicable to LLMs. Providing examples from GPL licensed material to answer a question isn't a license violation. The human copying that code (from any source) and pasting it into their own software is a license violation.
---
Some while back I had a discussion with a Swiss developer about the indefinite article used before "hobbit" in a text game. They used "an hobbit" and in the discussion of fixing it, I quoted the first line of The Hobbit. "In a hole in the ground there lived a hobbit." That cleared it up and my use of it in that (and this) discussion is fair use.
If someone listening to that conversation (or reading this one) thought that the bit that I quoted would be great on a T-shirt and them printed that up and distributed it - that would be a copyright violation.
The Ninth Circuit did, however, overturn the district court's decision that Google's thumbnail images were unauthorized and infringing copies of Perfect 10's original images. Google claimed that these images constituted fair use, and the circuit court agreed. This was because they were "highly transformative."
If I was to then take those thumbnails from a google image search and distribute that as an icon library, I would then be guilty of copyright infringement.
I believe that Stack Overflow, Google Images, and LLM models and their output constitutes an example of transformative fair use. What someone does with that output is where copyright infringement happens.
My claim isn't that AI vendors are blameless but rather that in the issue of copyright and license adherence it is the human in the process that is the one who has agency and needs to follow copyright (and for AI agents that were unleashed without oversight, it is the human that spun them up or unleashed them).
That's really interesting. I'm a lawyer, and I had always interpreted the license like a ToS between the developers. That (in my mind) meant that the license could impose arbitrary limitations above the default common law and statutory rules and that once you touched the code you were pregnant with those limitations, but this does make sense. TIL. So, thanks.
Licenses != contracts, and well, the FSF's position has always been that the GPL isn't a contract, and contracts are what allow you to impose arbitrary limitations. Most EULAs are actually contracts.
Does the reasoning in the cases where people to whom GPL software was distributed could sue the distributor for source code, rather than relying on the copyright holder suing for breach of copyright strengthen the argument that arbitrary limitations are enforceable?
Unrelated question regarding this part, since you seem to be an expert on this:
> If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway.
How is it that contracts can prohibit trial by jury but they can't ban prohibit fair use of copyrighted work? Is there a list of things a contract is and isn't allows to prohibit, and explanations/reasons for them?
The general answer is because there is a statute or court opinion that says so for one thing and a different one that says something else for the other thing.
It's also relevant that copyright (and fair use) is federal law, contracts are state law and federal law preempts state law.
It sounds that way a bit from the one sentence. But that’s not the case at all.
> 4. MODIFICATIONS
> You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release
the Modified Version under precisely this License, with the Modified
Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy
of it. In addition, you must do these things in the Modified Version:
Etc etc.
In short, it is a copyleft license. You must also license derivative works under this license.
Just fyi, the gnu fdl is (unsurprisingly) available for free online - so if you want to know what it says, you can read it!
And the judgement said that the training was fair use, but that the duplication might be an infringement. The GFDL doesn't restrict duplication, only distribution, so if training on GFDLed material is fair use and not the creation of a derivative work then there's no damage.
Right. I can publish the work in whole without asking permission. That’s unrestricted duplication.
However, as i read it, an LLM spitting out snippets from the text is not “duplicating” the work. That would fall under modifications. From the license:
> A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
I read that pretty clearly as any work containing text from a gnu fdl document is a modification not a duplication.
1) Obtaining the copyrighted works used for training. Anthropic did this without asking for the copyright holders' permission, which would be a copyright violation for any work that isn't under a license that grants permission to duplicate. The GFDL does, so no issue here.
2) Training the model. The case held that this was fair use, so no issue here.
3) Whether the output is a derivative work. If so then you get to figure out how the GFDL applies to the output, but to the best of my knowledge the case didn't ask this question so we don't know.
For this to stand up in court you'd need to show that an LLM is distributing "a modified version of the document".
If I took a book and cut it up into individual words (or partial words even), and then used some of the words with words from every other book to write a new book, it'd be hard to argue that I'm really "distributing the first book", even if the subject of my book is the same as the first one.
This really just highlights how the law is a long way behind what's achievable with modern computing power.
You’re just describing transformative use. I’m not a lawyer, but an example from music - taking a single drum hit from a james brown song is apparently not transformative. Taking a vibe from another song is also maybe not transformative, e.g. robin thicke and pharrell’s “blurred lines” was found to legally take the “feel” from Marvin Gaye’s “Got to Give it Up”
Which is all to say that the law is actually really bad at determining what is right and wrong, and our moral compasses should not defer to the law. Unfortunately, moral compasses are often skewed by money - like how normal compassess are skewed by magnets
Presumably, a suitable prompt could get the LLM to produce whole sections of the book which would demonstrate that the LLM contains a modified version.
I am distrubting an svg file. It’s a program that, when run, produces an image of mickey mouse.
By your description of the law, this svg file is not infringing on disney’s copyright - since it’s a program that when run creates an infringing document (the rasterized pixels of mickey mouse) but it is not an infringing document itself.
I really don’t think my “i wrote a program in the svg language” defense would hold up in court. But i wonder how many levels of abstraction before it’s legal? Like if i write the mickey-mouse-generator in python does that make it legal? If it generates a variety of randomized images of mickey mouse, is that legal? If it uses statistical anaylsis of many drawings of mickey to generate an average mickey mouse, is that legal? Does it have to generate different characters if asked before it is legal? Can that be an if statement or does it have to use statistical calculations to decide what character i want?
I don't like the editorialized title either but I would say that the actual post title
"The FSF doesn't usually sue for copyright infringement, but when we do, we settle for freedom"
and this sentence at the end
" We are a small organization with limited resources and we have to pick our battles, but if the FSF were to participate in a lawsuit such as Bartz v. Anthropic and find our copyright and license violated, we would certainly request user freedom as compensation."
Is it? The FSF's description of the judgement is that the training was fair use, but that the actual downloading of the material may have been a copyright infringement. What software does the FSF hold copyright to that can't be downloaded freely? Under what circumstances would the FSF be in a position to influence the nature of a settlement if they weren't harmed?
Copyright infringement causes harm, so if there's no harm there's no infringement. You can freely duplicate GFDLed material, so downloading it isn't an infringement. If training a model on that downloaded material is fair use then there's no infringement.
If it's pretty fucking simple, can you point to the statement in the linked post that supports this assertion? What it says is "According to the notice, the district court ruled that using the books to train LLMs was fair use", and while I accept that this doesn't mean the same would be true for software, I don't see anything in the FSF's post that contradicts the idea that training on GPLed software would also be fair use. I'm not passing a value judgement here, I'm a former board member of the FSF and I strongly believe in the value and effectiveness of copyleft licenses, I'm just asking how you get from what's in the post to such an absolute assertion.
what I keep wondering is what kind of laws will be rendered useless with the precedent they'll cause. Can this be beginning of the end of copyright and intellectual property?
Copyright, possibly. Intellectual property more broadly, no. AI has 0 impact on trademark law, quite clearly (which is anchored in consumer protection, in principle). Patent law is perhaps more related, but it's still pretty far.
Why are you wondering? Any law that limits the ability of capital owners to extract wealth will be overturned, and not just from AI, that's global in every industry everywhere there are humans.
Doubt it. I'm sure it will have an exclusion where for example using genAI to train on or replicate leaked or reverse-engineered Windows code will constitute copyright infringement, but doing the same for copyleft will be allowed. Always in favor of corporate interests.
And you're stuck with whatever fucked up kernel the vendor gave you, assuming they even followed their obligations and gave you access to the source. The vast majority of x86 systems run mainline kernels because there's a sufficient level of abstraction. The number of Arm devices that's true for is a tiny percentage of the Arm devices out there running Linux.
What? You can build an entirely free UEFI. ACPI has a free compiler and a free interpreter. Neither implies or requires the existence of non-free blobs, and neither implies or requires any code running in a more privileged environment than the OS.
I've a bunch of devices running coreboot with a Tianocore payload, but they're largely either very weird and now unavailable or I haven't upstreamed them so it's not super helpful, but it's absolutely not impossible and you can certainly buy Librebooted devices
Let's say some hw manufacturer would open-source the required specs to implement it on it's chips. (Very unlikely, but let's say they do...) So what? Dangerous capabilites remain.
Until UEFI and secure boot, SMM would run code provided by the BIOS. BIOS was updatable, moddable, replaceable. See coreboot and numerous BIOS mods such as wifi whitelist removal.
Trustzone usually runs code from eMMC. These chips are programed in factory with a secret key in the RPMB partiton. It's a one-time operation - the user can't replace it. Without that key you can't update the code Trustzone executes. Only the manufacturer can update it.
Also, any ring -2 code can be used for secure boot locking the device to manufacturer approved OS, enforce DRM, lock hardware upgrades and repairs, spy, call home, install trojans by remote commands, you name it. And you can't audit what it does.
To respond in more detail: secure boot (as in the UEFI specification) does nothing to prevent a user from modifying their system firmware. Intel's Boot Guard and AMD's Platform Secure Boot do, to varying degrees of effectiveness, but they're not part of the UEFI spec and are not UEFI specific. I have replaced UEFI firmware on several systems with Coreboot (including having ported Coreboot to that hardware myself), I am extremely familiar with what's possible here.
> Trustzone usually runs code from eMMC.
This might be true in so far as the largest number of systems using Trustzone may be using eMMC, but there's nothing magical about eMMC here (my phone, which absolutely uses Trustzone, has no eMMC). But when you then go on to say:
> Without that key you can't update the code Trustzone executes. Only the manufacturer can update it.
you're describing the same sort of limitation that you decried with SMM. As commonly deployed, Trustzone is strictly worse for user freedom than SMM is. This isn't an advantage for Arm.
> Also, any ring -2 code can be used for secure boot locking the device to manufacturer approved OS
No, the secure boot code that implements cryptographic validation of the OS is typically running in an entirely normal CPU mode.
> enforce DRM
This is more typical, but only on Arm - on x86 it's typically running on the GPU in a more convoluted way.
> lock hardware upgrades and repairs
Typically no, because there's no need at all to do any sort of hardware binding at that level - you can implement it more easily in normal code, why make it harder?
> spy
When you're saying "can be used", what do you mean here? Code running in any execution environment is able to spy.
> call home
Code in SMM or Trustzone? That isn't literally impossible but it would be far from trivial, and I don't think we've seen examples of it that don't also involve OS-level components.
> install trojans by remote commands
Again, without OS support, I'm calling absolute bullshit on this. You're going to have an SMM trap on every network packet to check whether it's a remote command? You're going to understand a journaling filesystem and modify it in a way that remains consistent with whatever's in cache? This would be an absolute nightmare to implement in a reliable way.
> And you can't audit what it does.
Trustzone blobs do have a nasty habit of being encrypted, but SMM is just… sitting there. You can pull it out of your firmware. It's plain x86, an extremely well understood architecture with amazing reverse engineering tools. You can absolutely audit it, and in many ways it's easier to notice backdoors in binary than it is in source.
Trustzone is mostly deployed on Devicetree-based platform. What saves you here isn't the choice of firmware interface, it's whether the platform depends on hostile code. If you don't care about secure boot (or if you do but don't care about updating the validation keys at runtime), you can implement a functional UEFI/ACPI platform on x86 with zero SMM.
I appreciate your detailed reply. I think we are looking from different perspectives. You are correct in an item-by-item way, but you need to put them all together and see the bigger picture. In my comment, it may have made a mess confusing technologies and their capabilities, but I was looking at the forest, not the trees.
There are only two viable firmware alternatives in the world right now: ring 0 U-boot* or the ones that use ring -2: UEFI* and various bootloaders +TrustZone in Android world (read the footnotes!). Manufacturers usually focus on only one of the two: either ring -2 (locked bootloaders, UEFI +ACPI +SMM +whatever crapware they may want to add) protected by secure boot or ring 0 U-boot +a device tree +their GPL source code. The ones interested in locked-down platforms choose the ring -2 option and they are not going to make it open source, nor provide the signing keys to allow it to be replaced by FOSS alternatives.
I appreciate freedom. Linux kernel is free (ring 0). U-boot and coreboot are free (ring -2 if they include ACPI / SMM, else still ring 0). When I run a Linux kernel, I don't want it preempted and sabotaged by a ring -2 component. If that ring -2 includes proprietary blobs, then it's a hard "no" from me. You may argue that SMM (and ACPI) brings useful features such as overheating shutdown when the kernel froze/crashed or the system is stuck at bootloader, but let's face it: practically there's no free alternative to manufacturer's blobs when it comes to ring -2. The FOSS community barely keeps u-boot and the device tree working. Barely! An open source UEFI + all that complexity for every single board out there is a no-go from the start. If you ported Coreboot, i'm sure you know how difficult it is.
I recently learned that ACPI can be decompiled to source code, so that's an improvement, but not by much. Unlike a device tree, which is only a hardware description, ACPI is executable code. I see that as a risk and I'm not the only one. Even Linus had something to say about it - the quote is on wikipedia article. Some of that code executes in ring -2. It can also install components in the OS - spyware components - you can also read about that in the wikipedia article. U-boot has the capability of creating files on some filesystems and you can argue that a proprietary fork could maliciously install OS components by dropping something in init.d, but I've never heard of it being misused that way, and a manufacturer must publish the GPL source code, so it would be difficult to hide. A device tree can't to that at all. If you use UEFI, then every single blob published by the manufacturer must be decompiled and be inspected. U-boot + ACPI is probably simpler than porting Coreboot, but it still won't happen. There are simply too many systems to support.
So, as a conclusion. I see ring -2 as a dangerous capability (even if the malware itself doesn't execute in ring -2) because there are no viable open source alternatives. For this reason I encourage you to not support or promote UEFI and ring -2.
> Trustzone is strictly worse for user freedom than SMM is. This isn't an advantage for Arm.
> Trustzone is mostly deployed on Devicetree-based platform.
True, but ARM world still has unlocked CPUs that can boot unsigned firmware. There are none left in x86 world. (Or at least none that I know about.)
> No, the secure boot code that implements cryptographic validation of the OS is typically running in an entirely normal CPU mode.
OK, valid observation, I may have used "ring -2" to describe features that are not typically running in ring -2. I tried to avoid these technologies as much as possible and I don't have much hands-on experience about what runs where.
> you can implement a functional UEFI/ACPI platform on x86 with zero SMM.
One dev could probably implement and maintain one or maybe 5-10 systems if they are related (same CPU, mostly same hardware). How many systems are there and how many devs? Not possible, but for very very few exceptions, as long as some random dev got one of these systems for himself and does it as a pet project.
----
* When I say U-boot, I mean mainline U-boot plus a device tree, or forks with pubished GPL source code. I know U-boot can include ACPI and secure boot, but that's not what I mean in the context of this comment. Sure, you can set up secure boot with open source U-boot if you want to. There's nothing wrong with that.
* When I say UEFI, I mean all related technologies: ACPI, SMM, secure boot, signed firmware, etc. The whole forest.
There's a standard like ACPI for Arm devices - it's called ACPI, and it's a requirement for the non-Devicetree SystemReady spec (https://www.arm.com/architecture/system-architectures/system...). But it doesn't describe the huge range of weirdness that exists in the more embedded end of the Arm world, and it's functionally impossible for it to do so while Arm vendors see devices as an integrated whole rather than a general purpose device that can have commodity operating systems installed on them.
Fi launched with Sprint and T-Mobile roaming and added US Cellular, but is presently T-Mobile only. I don't think AT&T has ever been a supporter carrier.
The article makes clear that the orientation of the lettering has changed over time, which counts against the idea that what it is now necessarily reflects the original intent.
To me the evidence in the article still suggests that “hard correctness” is probably not historically appropriate…hand lettering is not a typeface.
That’s really where I am coming from — the perspective of historical architecture, historical architectural practice, and historical methods of delivering buildings.
In particular, today’s mythological Wright is not the 1908’s historical Wright on a commercial jobsite. And the contractual relationships of a 1908 construction project were not delineated like current construction projects.
And yet the article shows the original sketches Wright made for the building that show the asymmetrical H's with the bars aligned with the bars on the E's (i.e on the upper half) in virtually identical font to what was eventually installed.
I don't really see how you can come away with the conclusion that this suggests lack of intent; at most, it seems like you had already formed the opinion that there was no intent, and you didn't find the evidence to the contrary convincing enough that you were wrong. I don't think your take is necessarily wrong, but I don't think it's fair to characterize the evidence as suggesting what you're saying.
reply