Redditors have long been speculating whether their content has been used to train LLMs like ChatGPT. Now the speculation can finally be laid to rest. In a deal that has spurred as much indignation as it has laughter, Reddit has announced that it will be partnering with OpenAI moving forward. The new deal allows Reddit to sell its user-generated content (UGC) to the AI company while allowing it to feature unspecified AI products/services. Monetisation has been a key priority for Reddit since the lead-up to its IPO earlier in March this year. And Reddit shareholders—which includes Sam Altman—can be pleased: shares went up by as much as 15% following the announcement of the partnership in May.
OpenAI has been in hot water with quite a few companies over the purported theft of copyrighted content. Eight newspapers, including The Chicago Tribune, The New York Daily News, and The Orlando Sentinel, are currently in the process of suing OpenAI for appropriating published content as training data for its Large Language Model. OpenAI, on the contrary, is defending this as fair use. OpenAI’s current strategy appears to be partnering with publishers and partners outright. Since December last year, OpenAI has been partnering with Axel Springer, a prominent German media company with publications such as Politico and Business Insider under its belt. OpenAI is also reportedly in talks with CNN, Fox, and Times for deals much like Reddit’s.
Reddit’s deal with OpenAI was first teased back in April 2023 by Reddit founder and CEO Steve Huffman. Reddit currently hosts over 100,000 communities, or subreddits, that link people across the world through their opinions on different topics. This has resulted in decades’ worth of user-generated content for Reddit that will now be incorporated into ChatGPT and possibly into future AI products and services that the company develops.
As mentioned, Reddit has been very open in its pursuit of monetisation since its intent to go public was first announced. Reddit made the official announcement that it would be sharing its Data API for a fee in 2023. This resulted in a mass protest by users and a shutdown of many third-party apps that went mostly unnoticed by the company. A deal with Google to allow its AI models to train on its data, valued at about $60 million a year, followed shortly after. The latest announcement regarding the deal with OpenAI was preceded by another announcement that denounced the very same issue that the Redditors had raised: the use of user-generated content for commercial purposes without due recognition or recompense. “Unfortunately, we see more and more commercial entities using unauthorised access or misusing authorised access to collect public data in bulk. Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests.”
Seven days later came Reddit’s justification for monetising its content against its own principles (which it did not deign to acknowledge): “Keeping the internet open is crucial, and part of being open means Reddit content needs to be accessible to those fostering human learning…” In addition to the unspecified amount involved in the deal, doubling back on their words allows Reddit also to take OpenAI on board as an advertising partner. On the contrary, OpenAI will benefit from something it has lacked so far: real-time data. The access to the Reddit Data API allows ChatGPT to finally learn from culturally relevant data, in real-time from up-to-date human conversations.
According to Reddit co-founder and CEO Steve Huffman: “Reddit has become one of the internet’s largest open archives of authentic, relevant, and always up-to-date human conversations about anything and everything. Including it in ChatGPT upholds our belief in a connected internet.”
If that sounds dystopian, that’s because it is, to some extent.
According to Deutsche Bank, this is nowhere near the end for Reddit either. Reddit is making further progress in signing new deals with ‘social-listening’ (i.e., data gathering) companies and finance companies. According to their estimates, Reddit can be expected to sign two additional LLM-based partners to sell their data. Only time can tell how users will respond when their content starts being referenced in AI-generated answers, by several companies at that.
(Theruni Liyanage)