You guys are acting like GPT-4 has a reasonable fidelity copy of everything on the Internet. It’s estimated to have 1 trillion parameters. Assuming it’s using 16-bit Floating Point, that means each parameter is 2 bytes so it has 2 trillion bytes of information.
The Web is estimated to have 1200 Zetabytes. So we cram 1200 Zetabytes into Etabytes, Petabytes, and finally down to our 2 Terabytes of parameters. So an LLM has got a very fuzzy copy that’s less than a billion times the resolution of the original.
I think it does? I think it plays into whether it is actually infringing on the rights of the author by making it so that someone doesn’t need to pay them for the work.
One of the components of determining fair use is based on that factor, whether redistribution of a work is itself directly competing with the author themselves.
In the case described here, it’s not really an effective competition. Literally no one would intentionally use one of these models in order to recreate one of the training images.
That’s one of the things that makes this weird, in that the argument about copyright with these AI models actually DOES hinge upon the idea of competing with the original authors. However, it’s a very weird case compared to prior copyright cases, since it’s not competing with the original copyrighted work. It’s competing with future, uncreated works, and it’s not competing with works that violate copyright. It’s creating novel output that themselves couldn’t be reasonably described as a violation of copyright.
It’s almost suggesting that the model has somehow copied the artist’s ability to create works (which it in some sense it has) by observing the past work, and that the artist had a copyright over that ability. That notion of copyright would be very non standard.
The more feasible copyright argument, I think, is that the model itself somehow contains the copyrighted work itself, in a form that does not constitute a transformative work. That would fit within traditional copyright law, but I think it’s a hard argument to make. Probably the most likely argument to win though.
I guess that’s what I was trying to get at, in a way. Even if I don’t use the copyrighted works, by purchasing their product/service, I’m still purchasing something that contains them. Maybe not in some very specific technical definition, perhaps, but practically, functionally it does, right?
And, to be clear, I’m not arguing that it is copyright infringement. But it certainly doesn’t seem like a stretch if a judge were to decide it is.
Kind of, but simply containing a copyrighted work in some way doesn’t necessarily constitute copyright infringement. Again, this is the fundamental idea of fair use. Some uses of copyrighted work are fine, and don’t constitute infringement.
Some major components of deciding that are the effect of the new work on the market or value of the original work, and whether the new work is transformative.
The first part is what I was alluding to, in that is kind of weird in that it’s ability to create the actual copyrighted work doesn’t actually impact the value of the original work in this case. No one would use it for that, so it has no impact at all. It can be argued that these models do have some impact on value of future works, essentially devaluing the author’s skills themselves, but this is a kind of novel argument (to some degree because humans have never developed technology capable of anything like this before).
The transformative aspect is a major hurdle for a copyright argument against these models though. While the original copyrighted works might be considered as being in there in some way, I feel like it is hard to argue that it hasn’t been transformed to a really profound degree. There’s really no part of these models that represent any singular part of their training data. You just have a lot of parameters which have been adjusted through exposure to the data. To this end, the model doesn’t really contain the training data itself, but rather the memory of having been exposed to it. It’s a weird, perhaps subtle difference that’s hard to understand.
To me, it’s hard to look at any individual part of the training data, it even the training data in its entirety, and then look at the model and not consider it a profoundly different thing, and thus having a strong argument that it is fair use by being such a profoundly transformative work.
The whole point of machine learning (and science) is to find regularities in the data and compress them. For example, you could compress this thread down to about 5% of it’s existing text if we removed all the times we’ve re-litigated the same topics without any new data.
I’m pretty damn certain I just encountered a fake true crime YouTube video about a murder that did not happen. All pretty much AI generated. Reported it to YouTube.