Answering Who Owns the Data AI is Trained On?

Buenos Aires, Argentina - April 07, 2023: 3D rendering a lot of square badges with the logo of Reddit in a close up view

There are many concerns with the fast development of Generative AI, but as Reddit’s IPO raised the FTC’s compliance radar, the debate on who owns the data that trains AI was brought back to the forefront.

To say that Reddit’s IPO is causing a stir is an understatement. As of this writing, the IPO is four and five times oversubscribed, meaning more people want shares than the 22 million it will likely issue.

Part of that excitement is due to generative AI. Google and other AI companies are just itching to get their hands on Reddit’s vast pool of user-generated content to train its models, and the company believes selling it can bring in $203 million in just a few years.  

But, on March 14th, Reddit received a letter from the FTC raising questions in a non-public inquiry about its plans to sell, license, and share all that UGC to third parties looking to train their AI models. 

In a regulatory filing, Reddit said the filing is a real bummer: “Regulatory engagements can be lengthy and unpredictable. Any regulatory engagement may cause us to incur substantial costs, and it is possible for any regulatory engagement to result in reputational harm or fines, cause us to discontinue or modify our products, services, features, or functionalities, require us to change our policies or practices, divert management and other resources from our business, or otherwise adversely impact our business, results of operations, financial condition, and prospects.”

Some Reddit users are unhappy with the plan to sell their UCG and are worried about a loss of privacy. Others aren’t keen to miss out on an opportunity to earn money from their years of posting. 

AI models need tons of data to be trained accurately, and Reddit’s 17 billion posts written in natural language fit the bill. As a platform and not a publisher, does Reddit have the right to sell that content?

A Different Category Than Publishers

The publishing industry has been fighting this battle since OpenAI released ChatGPT in November 2022. 

OpenAI has been making deals with publishers to license their content for training purposes, paying up to $5 million a year. Big publishers, such as Alex Springer, have signed on.

For some in the publishing industry, generative AI is a reality they need to live with, and licensing fees may be a way to get back some of the revenue they’ll lose due to Google SGE, which some call an extinction-level event for media.

At AdMonsters’s Publisher Forum, keynote speaker Burhan Hamid, CTO at Time, shared that licensing fees may provide a revenue stream for publishers. However, he sees other ways for publishers to seize the opportunities of AI before their competitors beat them to it

Still, publishers pay their journalists and writers for the content they publish, which means they own it. If they’re okay with licensing it to AI companies for training, at least everyone in the equation gets paid for their efforts. This is not the case with Reddit, Meta, and many other examples.

Training AI on People Data isn’t New

In January 2020, the New York Times ran a story on Clearview AI, a company that “devised a groundbreaking facial recognition app. You take a picture of a person, upload it and get to see public photos of that person, along with links to where those photos appeared.” 

According to the Times, Clearview began with a database of 3 billion photos, all of which it scraped from Facebook, YouTube, and Venmo alongside “a million other sites.” (Today, claims to have an astounding 30 billion photos that train its AI.)

In January 2019, IBM announced “a new large and diverse dataset called Diversity in Faces (DiF) to advance the study of fairness and accuracy in facial recognition technology.” The dataset relied on “publicly available images from the YFCC-100M Creative Commons data set,” which the company annotated using “10 well-established and independent coding schemes from the scientific literature.” 

So what is the YFCC-100M Creative Commons data set exactly? It turns out it’s 99.2 million photos and .8 million videos that Flickr users uploaded over a 10-year period, which Yahoo pilfered from the platform (the YF in the name of the dataset stands for Yahoo Flickr). Yahoo snagged more than just photos, of course. Metadata in the dataset include title, description, tags, geo-tag, uploader information, capture device information, URL to the original item.” Does metadata include PII data if, say, Flickr users title their photos with their names?

More recently, Meta announced Meta AI, a chatbot and image generator they trained on posts and images shared by Facebook and Instagram users. Those users may not like that Meta used images of their nieces or grandbabies to train an AI image generator, but that horse has already left the barn.

The Market to The Rescue?

Congress has made it clear that it believes publishers should be compensated when AI is trained on their content, and the courts may back this up if the New York Times prevails in its suit against OpenAI. Some of the privacy laws, as well as the Biden Administration’s AI Bill of Rights, will help protect the everyday citizen’s privacy, but nothing addresses the IP of our content.

But there may be one saving grace: the market may reject AI generated on UGC. We’re already seeing a healthy skepticism for AI-generated content. Jared Sawdoski has coined the term Habsburg AI, which he defines as “a system that is so heavily trained on the outputs of other generative AI’s that it becomes an inbred mutant, likely with exaggerated, grotesque features.”

Plenty of examples of grotesque content and art generated by AI get passed on to the consumer. Last summer, the New York Times reported that truly awful AI-generated travel guides were flooding Amazon. The Atlantic reported that AI-generated “junk” is being passed as unique art on Etsy. The AI companies may eventually conclude that training their models on Facebook rants of Aunt Betsy and Uncle Joey results in garbage and cease to buy that data.

One thing is clear, however: Answering who owns the training data and how those owners should be compensated is a thorny issue that will take years to resolve. In the meantime, everyone should assume that AI models are using all content. Period.