Former OpenAI Researcher Highlights How OpenAI Violated Copyright Law in Training ChatGPT

Photo Credit: Andrew Neel

OpenAI has been fraught with leadership changes as executives pour out of the company like water through a sieve. The latest departure is a former researcher who says the company broke copyright law and is destroying the internet. Here’s the latest.The New York Times reports Suchir Balaji’s departure from OpenAI after spending four years as an artificial intelligence researcher with the company. He was instrumental in helping OpenAI hoover up enormous amounts of data, scraping the web for knowledge to build out its large language models (LLMs).

Balaji told The NY Times that while working for OpenAI, he did not consider whether the company had a legal right to build its products by scraping data from other sources. He assumed any data published on the internet and available freely was up for grabs—whether the data was copyrighted or not. So pirate sites that archive copyrighted books, paywalled news sites, and even reddit posts were fair game for the massive data machine.

Balaji says in 2022 he thought harder about how the company was approaching data collection and came to the conclusion that how OpenAI gathered data was a violation of copyright law and that technology like ChatGPT was damaging to the internet as a whole. In August 2024, Balaji departed the company because he believed OpenAI would cause more harm than societal benefit.

“If you believe what I believe, you have to just leave the company,” Balaji told The New York Times. Balaji joined Open AI in 2020 at just 25-years-old, drawn to the potential AI presents for problems like finding curs for diseases and stopping aging. Instead, he says he found himself at the helm of a technology that is “destroying the commercial viability of the individuals, businesses, and internet services that created the digital data used to train AI systems.”Earlier this week, Balaji published an essay on his website detailing his concerns about the future of OpenAI. He believes that how AI companies gather data does not fall within the ‘fair use’ that AI data companies like OpenAI and Anthropic are arguing—saying regulation of AI is the only way out of this mess.

“While generative models rarely produce outputs that are substantially similar to any of their training inputs, the process of training a generative model involves making copies of copyrighted data,” Balaji writes. “If these copies are unauthorized, this could potentially be considered copyright infringement, depending on whether or not the specific use of the model qualifies as ‘fair use.’”

“Because ‘fair use’ is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use.” Balaji points to traffic drops for major sites like Stack Overflow as potentially destroying the internet as new users ask their questions to generative AI models rather than the human help resource that the model was trained on. While OpenAI has arranged for licensing agreements with several newspapers, it still faces lawsuits from authors who say they did not consent to an LLM being trained on their copyrighted works.