
“I think that revivals and the creation of fan fiction, they can both mark copyright issues, in that fan fiction often has to take from expressive elements, a copyrighted character, a character that is famous enough to be protected by copyright law or plot stories or sequels,” Smith said. “If these things are copied and reproduced, that output could be potentially infringing.”
But this is still a gray area. Given the blog, Smith said, “I would be concerned,” but “I wouldn’t say it’s automatically a violation.”
Smith told Ars that, in pulling the blog, Microsoft was “probably smart”, because courts have generally only ruled that AI training on copyrighted books is fair use. But courts continue to investigate questions about pirated AI training materials.
On the deleted Kaggle dataset page, Mendola previously explained that for the data source, he “downloaded ebooks and then converted them to txt files.”
Microsoft may have infringed copyright
If Microsoft ever faced the question of whether the company knowingly used pirated books to train example models, fair use “could be a tough argument,” Smith said.
Hacker News commentators suggested that the blog could be considered fair use, as the training guide was for “educational purposes”, and Smith said that Microsoft could raise some “good arguments” in its defense.
However, he also suggested that Microsoft could be held liable in some ways for contributing to the violation on some level after leaving the blog for a year. Before being deleted, the Kaggle dataset was downloaded more than 10,000 times.
“The end result is to create something infringing by saying, ‘Hey, you go, grab that infringing content and use it in our system,'” Smith said. “They could potentially have some kind of secondary contributory liability for violating copyright, downloading it, and then using it to encourage others to use it for training purposes.”
<a href