Whose rights are they anyway?

LLMs, Models and the datasets they train on....

Aug 22, 2024

Back in 2008 or late 2007, I don’t remember which, (I don’t have my email archive), I was contacted by a new organization calling themselves “Common Crawl”. I was Google’s open source lead and would often get these unusual organizations that were ‘open source’ adjacent reaching out to me.

I had joined and donated Google’s money to a large number of open source and free software organizations, and the organization's founder, Gil, had come out of Applied Semantics (which had been acquired by Google and made into AdSense). He had left Google in 2007, and knew that Google had most of the crawls dating back to the company's founding.

I was always a pretty soft touch for ex-Googlers looking to improve the web, and I *think* we gave them maybe 25 or 50k. But they really were more interested in the crawl. I told them I’d get back to them after having a chance to think it through.

What is a crawl? It’s simply all those text files and images and, later, video that is what you think of as the Web. Google crawls the web, tries to keep it up to date and then indexes against it for search and ads. The most recent Common Crawl is about 70tb of compressed data, and they’d be the first to tell you that’s hardly an exhaustive crawl.

Anyhow, you should think of Google as one big ‘fair use engine’, it takes information it does not own or have rights to, learns from it, and presents it to the google user.

Google would be sued over this and would fight in the courts to determine if that use is fair use or not. The ‘perfect 10’ lawsuits over image thumbnails, the oracle lawsuit around api’s and their copyrightability, the book scanning and Viacom lawsuits all came down to the question of ‘what is okay to do with the information you based or trained your systems on’.

Back to the common crawl, in speaking with our remarkable legal counsel I determined that the crawls were simply not ours to give, as re-distributions would push the boundaries of fair use well beyond what we thought would be practical or defendable. I remember the folks at Common Crawl being disappointed but not surprised.

Fast forward to 2024, artificially intelligent systems start with the common crawl, add other datasets and train their large language models on the breadth and depth of human expression and then release a product, trained model or access to the same. I would assert this is an absolutely fine fair use of that data, but acknowledge that this will be a court battle to confirm that. Companies like OpenAI will likely be fighting that lawsuit for at least 10 years. To expose my bias, I think that having LLMs learn from datasets they do not own is not only fair use, but a singularly important and righteous act to move society forward. That’s a topic for another article, I think.

Wait, what is the topic of this article then? Well, I would assert that the current approaches to ‘open source ai’ , being so focused on copyright, fail to take into account that if you think you can apply copyright to the weights and measures that come out of an LLMs’ training stage, then you can’t just assume the rights from all the source data. I would assert that the weights that are generated by a LLM are not copyrightable at all, and should not be treated as such.

Are they a new kind of Intellectual property? Maybe? I don’t think so, but many might disagree with me. I think they are more akin to a recipe, a collection of facts that is inherently not seen as protected by copyright or other intellectual property regimes like patent or trade dress.

To give an anodyne, and maybe a little tired, example: The ‘formula’ for Coca-Cola is known to be protected and if it were to leak from the company, that ‘trade secret’ would not be protectable. That’s not to say someone could steal it and use it, that would still be theft, but if a company were to reverse engineer the formula they could go and sell their soda pop without coca-cola being able to stop them. This is why Coke spends so much on trade dress, trademark and other actions so that were a competitor to release ‘Just Like Coke’ in a coke-looking bottle using the coke-colors or something, they would find themselves in a very costly legal battle.

But there are tons of ways to compete against coke by being as good, but cheaper, or better with your new formula, etc… and a careful marketer could do this and be quite resilient against even an aggressive lawsuit against them. I’m not stanning for RC cola or Pepsi here, mind you, so please keep the cola wars out of the comments.

Anyhow, back to LLMs, I assert the alphanumerical output of the training stage of these LLMs are in fact, no different than a recipe for goulash or Coke or a whiskey sour and to release them to the public is to literally give up control of their contents. So when you see a Terms of Service around an LLM that says ‘don’t use this in this manner or that’ I would assert that those terms mean less than nothing.

If you feel they can and should be applicable, I’ll ask this question: if you think that the producers of LLMs can release the weights as if it is protected intellectual property, then you must also think that they must have the rights over the source data to do so. Which I disagree with. I acknowledge that is a fight that the courts will settle after a decade or so of expensive legal work by the large tech companies, likely with the assorted publishers and news organizations on the other side of the table.

Please note, before you react to this, I am not saying that copyright infringement using these AI tools is okay. If you use a tool like Mistral or ChatGPT to create ‘shake it off’ you’re still stealing from Taylor Swift. But if you use ChatGPT to help you write a poem in the style of Bob Dylan then that’s just inspiration, and no different than if I were to listen to the assembled works of the Beatles and produce an album that honestly sounds and could be from them, even though it’s a completely separate work.

What I am saying is that basing ideas of open source AI in copyright is the wrong direction. I think the open source definition is a remarkable, inspiring document, and I have a history of putting open source licenses on non-copyrightable works (sitemaps comes to mind, which I applied the Apache license to, pre-creative commons) to signal intention and, well, emotion, but that was never going to hold up in court, and indeed when oracle was granted the remarkable ruling that APIs were copyrightable (I still disagree with the courts on that one) the Supreme Court decided that if they were copyrightable, the use of them is fair use lest modern industry would devolve into chaos (my words ,not theirs).

If you disagree I’d assert you need to show me that you’ve acquired the rights from the producers of all that content you’ve pulled into your training dataset before trying to release under a controlling regime. Alphabet has gone so far as to license data from Reddit, which I think is breathtakingly short sighted, but maybe the contract is written in such a way that they’re not actually doing that? Time will tell.

If you lament this line of thinking, I’d assert that you should be careful what you wish for.

Imagine if you read the classic tome ‘The C programming Language’ and had to send a dime to Kernighan and Ritchie every time you compiled a program. Or imagine reading Newton’s Principia and every time you press the brake pedal on a car, you’d need to give credit or send a dime to Newton's estate in Cambridge somewhere. The latter would be ridiculous, of course, it’s been centuries! But the first edition of the C book is now going on 46 years old. C would not have grown the way it did if it was not made available the way it was.

Imagine if George RR Martin had to send a dime to the Tolkien estate every time a dragon breathes fire? Or if Expanse Authors had to send a quarter to George Lucas every time their ships come under fire? Or Lucas had to send 5 cents to the Sergio Leone or Akira Kurosawa whenever someone fights in a cantina? Or Tim Cook had to ship a dollar to Xerox Parc every time Apple ships an overpriced computer to a user. I think you get my drift here.

There are those who would embrace such structures and have tried for centuries to monopolize thought and ideas and entrench the powerful. Plenty of writers who are stronger and smarter than I am (Hi Cory!) have argued in this space, so go read them! There is a spectrum of thinking from complete freedom to copy and redistribute information, programs, movies, etc… to something more measured to complete copyright/ip maximalism and monopolization.

Chris’s Substack

Discussion about this post