LTP 119: Rethinking AI Training

Panel

Bart Busschots (host) – @bbusschots – Flickr

In this solo show Bart shares his new thinking on the ethical and legal questions around AI training — since chatting with Antonio about AI a few months ago Bart has realised he was at least half-wrong in his original take!

While this podcast is free for you to enjoy, it’s not free for Bart to create. Please consider supporting the show by becoming a patron on Patreon.

Reminder – you can submit questions for future Q & A shows at http://lets-talk.ie/photoq

Notes

When myself and Antonio did our two-part crossover episode on AI and photography (LTP 112 & LTP 113) we joked that this was very unlikely to be the last we’d have to say on AI. It’s such a big topic, there are so many unknowns, and it’s evolving fast!

In this episode I want to focus in on just one aspect of AI’s intersection of photography— how the AI’s get trained. There will undoubtedly be many more episodes in future where I dive into the myriad other complex questions about what we can, should, and shouldn’t do with the AIs once they are trained, but this time, my focus is on the questions the training itself raises.

My inspiration for this episode is a recent realisation that my first take on the legalities and the ethics of AI training was at least half wrong.

I was very in favour of Adobe’s approach of training an AI on their own stock image library (Adobe Firefly), and I stand by that, but only because I expect training on a high quality curated data set should give better results. I’m sure the shareholders also like the fact that it neatly side-steps the controversies

What I‘ve changed my mind about is my initial criticism of AIs like Dall-E which are trained on the internet as a whole. I assumed that was a breach of copyright, and that it was theft. Now that I’ve had some time to think more deeply about the question, I’m pretty sure it’s not a copyright violation, and I don’t consider it theft anymore.

Firstly, let me give credit to the podcast interview that triggered me to re-evaluate my opinions — FLOSS Weekly 744: A Chill Pirate Lawyer – Damien Riehl, Open Source and Legal Rights — overcast.fm/….

Let’s start with the computer science part of this question — I’ll try ease you in to my thinking on my home turf

When you train an AI Model, what are you actually doing? You are tweaking the values in a big grid of numbers, nothing more. These numbers are the weights for each connection in a neural network. That’s all a model is, a collection of numbers that get loaded into a neural net. Once you’ve added your weights you just shove your question in one side of the neural net, and an answer falls out the other. When you’re training an AI you’re tweaking the weights to get the neural net to be ‘better’, for your definition of better.

When a model is brand new and knows nothing it has exactly the same number of weights as it will when you finish training it on half the Internet. As counter-intuitive as it may be to non-computer scientists, a model does not grow as you train it, it just changes. You start off with a grid of random numbers, and the training tweaks the values of those numbers.

That begs the question — what is being stored in an AI model? That’s actually a deeply philosophical question, but one things we do know for certain is that a model is not storing the content it’s being trained on. If the words or images fed to a model during training were being stored the model would grow and grow and grow. It would basically be impossible to train an AI on even 1% of the internet! Clearly, that’s not how these things work.

So what is being stored? I think the word that fits best is ‘ideas’. A model starts off with nonsense ideas, and the more you train it, the more its ideas approximate an average of the ideas expressed in the original works that make up the training data.

Hey — those sound like legalistic words? Yup!

Let’s now shift our attention to copyright — I’m gonna use US Copyright law to illustrate the concepts, but other industrialised democracies basically do it the same way, and most of the leading AI companies are based in the US anyway.

So, what was copyright designed to do? It was designed to protect ‘creative works’, where creative works are specific expressions of ideas. Note that it protects expressions of ideas. Copyright absolutely positively does not protect ideas — that’s what patents do (for a sub-set of ideas anyway)!

Copyright is very powerful because it’s automatic — if you produce a creative work, you hold its copyright by default without having to do anything (you have to take an action to waive or transfer those rights).

To counter-balance that power copyright is also limited — you get automatic protections in exchange for others getting to make ‘fair use’ of your work. This is analogous to patents where you share your invention with the world for anyone to use in future in exchange for legally protected exclusivity for a few years at the start so you can recoup your investment and make some profit before the flood gates open.

I really want to stress the point that you can only copyright specific expressions of ideas, not the ideas themselves. In legal jargon that’s known as the Idea–expression dichotomy.

Now let’s look at this idea of fair use. Firstly, fair use can get very complicated around the edges, and it often needs to be litigated on a case-by-case basis, but there are guiding principles for that kind of litigation, and they’re generally quite sensible.

I think we all know you can include part of a copyrighted work in another work if you’re adding new creativity of your own — we call this a ‘derivative work’. It can be a review, a rebuttal, or an overview of a broader area, and it can even be satire.

Transformative works are also fair use — if you use someone else’s copyright to create something new, original, and different, then that’s just fine from a copyright POV.

For example, if you ingest every book in the world to build an index mapping the line number, page number, and ISBN of every use of the word ‘poop’ in English literature, that’s fair use because the index you build is transformative. You started with books which have a copyright, but you ended with something completely different, an index, so no copyright problem.

I didn’t pull that example out of thin air — Google did something very similar to create Google Scholar, they scanned every book they could get their hands on so they could build a searchable index of all books, and allow people to search this index and see the matches in context with snippets of the surrounding text. Notice Google are doing more than just creating an index, they’re also storing and showing scans of the original pages. Authors were cranky and tried to sue Google for copyright infringement, and Google countered that their index is transformative, so it’s fair use, and the courts agreed with Google!

“Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals”

So, legally speaking, it seems clear to me that AI training is not a copyright infringement. AI models don’t contain copies of expressed ideas, they just store their own representation of averaged out ideas, and they do that by transforming text or images into collections of weights — there is no copyright on the ideas they capture, and their output is very transformational.

Now let’s circle back to the ethics of all this — imagine you’re an author who’s horrified by the idea of a poop index, and you learn that your book is listed in the index, which is a little embarrassing for you — you’re probably screaming “what gives you the take my copyrighted work!” That’s the same sentiment as photographers demanding to know “what gives OpenAI the right to trains Dall-e on my pictures”. For me this is simple, this is the quid-pro-quo at the heart of copyright — you get a heck of a lot of protection from copyright law, and you get those protections automatically, but the price you pay for those protections is fair use of your work by others.

So, big-picture, I’m just fine with AIs training on my photos, my blog posts, and my podcasts. But that doesn’t mean there are no other ethical or legal issues.

Even sticking with our narrow focus on training, there’s already an obvious legal and ethical morass — slurping/web crawling.

While I’m happy there’s no copyright problem using other people’s images to train an AI, the act of fetching those images could be problematic. Slurping data off the web consumes bandwidth, and bandwidth costs money — generating a huge bill for someone else so you can make a profit off their photos is not ethically OK in my book — it sounds like theft to me! It’s also illegal in at least some countries — in the US the deeply flawed an utterly anachronistic Computer Fraud and Abuse Act (CFAA) makes it illegal to break a website’s Terms of Service (TOS), so if Getty add a line to their TOS saying you can’t scrape their website, then legally speaking, in the US, you can’t scrape their website! Yes, I have some very sharp criticisms of the CFAA, and I won’t bore you with them on a podcast about the art and craft of photography, but when it comes to sites using the law to protect themselves from huge bandwidth bills, it seems just to me!

Now — I want to shift be stressing that just how many other legal and ethical issues there are with the USE of AIs. I’ve focused exclusively on their training, but once you have a trained AI, you now have a very flexible and powerful tool that can be used and abused in all kinds of ways. Like every other technology, there will need to be legal, organisational, and technological safeguards put in place, and while some of them will be obvious and in controversial, some of the edge cases will be real doosies!

One point I do want to make is that for a lot of this, we don’t need new laws because when something is illegal it’s illegal, regardless of how you pull it off. Using AI to do something illegal doesn’t suddenly make it magically legal, so a lot of this is going to come down to applying existing laws and legal presidents in new situations.

We have many rights, and they remain in a post AI world!

Just off the top of my head, here are some of the existing crimes, laws, and rights AI vendors will need to provide safeguards around:

Fraud
Theft
Counterfeiting
Our right to our good name
Our rights over our own likenesses
Our Trademarks and other protected intellectual property

I want to finish with a practical example that touches on the last of those protections I listed — trademarks.

A notorious example of what can go wrong with generative AI models that produce images that have been trained on the internet broadly is the appearance of what look like Getty Images logos on generated images. That logo is a legally protected trade mark, clearly, there’s a missing guardrail here!

People who don’t understand AI think this means the AI stored Getty’s original images and is copying-and-pasting bits of those images together to produce their output, and therefor it must be a copyright violation.

We know that’s not correct, but there is something important going on here. What has happened is that because Getty put their logo on so many images, AIs can end up learning that images of, say, footballers, contain circular things, people in shorts, green backgrounds, and a shape that looks like a Getty logo. This is not a copyright violation, but it is a breach of Getty’s trademark on the logo — the image is in effect a counterfeit. The logo identifies images as being part of Getty’s high quality curated catalogue, and the generated image is no such thing!

AI companies are going to need to add safeguards to prevent generative AIs breaching trademarks like this. By their very nature there are databases of trade marks, so theoretically this is actually a simple guardrail to erect — all that’s needed is n classifier AI inserted after the generative AI to filter the results and detect and block/remove anything that matches a registered trademark.

Like I said, there are of obviously going to be lots of other edge cases that will cause AI companies and society at large real headaches, so none of this will be easy, but that doesn’t change the fact that my reflexive copyright-based objections to AI training were wrong!

So basically, I’m happy for AI companies to learn from my photos as long as they don’t generate a huge bill for me!

Let's Talk Apple & Let's Talk Photography

Let's Talk Apple & Let's Talk Photography

LTP 119: Rethinking AI Training 1

Panel

Notes

Leave a ReplyCancel reply

One thought on “LTP 119: Rethinking AI Training”