
OpenAI may soon be forced to explain why it removed a pair of controversial datasets made from pirated books, and the stakes couldn’t be higher.
At the center of a class-action lawsuit by the authors alleging that ChatGPT was illegally trained on their tasks, OpenAI’s decision to remove the dataset could be a decisive factor in whether the authors win.
It is undisputed that OpenAI deleted the datasets named “Books 1” and “Books 2” before the release of ChatGPT in 2022. Created by former OpenAI employees in 2021, the dataset was created by scraping the open web and seizing large portions of its data from a shadow library called Library Genesis (LibGen).
As OpenAI points out, the datasets fell out of use that same year, leading to an internal decision to remove them.
But the authors suspect there is more to the story than that. He noted that OpenAI flopped by retracting its claim that “non-use” of the dataset was a reason for deletion, then later claiming that all reasons for deletion, including “non-use”, should be protected under attorney-client privilege.
To the authors, it seemed as if OpenAI was quickly backing down after the court granted the authors’ discovery requests to review OpenAI’s internal messages over the firm’s “non-use.”
In fact, OpenAI’s reversal made the authors more curious to see how OpenAI discussed “non-use”, and now they can know all the reasons why OpenAI deleted datasets.
Last week, U.S. District Judge Ona Wang ordered OpenAI to share all communications with in-house lawyers about the deletion of the dataset, as well as “all internal references to Libgen that OpenAI has redacted or withheld based on attorney-client privilege.”
According to Wang, OpenAI slipped up by arguing that “non-use” was not a “reason” for removing the dataset, while simultaneously claiming that it should also be considered a privileged “reason”.
<a href