The Atlantic created a searchable database of the music used to train AI

atlantic Reporter Alex Reisner recently revealed four datasets of music being used to train AI models and made them completely searchable to the public. Two of these sets are absolutely huge with 12 million and 9 million tracks. The other two are much smaller, but still represent a significant amount of training data at over 100,000 songs each.

According to Reisner, the sets have been downloaded thousands of times and, although it is impossible to know who has used them, both Google and Stability have confirmed this in research papers. Some sources, such as the Free Music Archive dataset, are free to stream for personal use, but require a license for commercial applications.

While theoretically the datasets are freely available on the Internet, using them as training data is not as simple as downloading a zip file and feeding it into an AI model. As Reisner explains:

The three datasets I found are distributed as lists of links to songs on YouTube or Spotify. AI developers download real audio using tools that automate the work, some of which allow developers to bypass logins, ads, and mechanisms that could earn money or subscribers for creators. Such tools violate the terms of service of these platforms.



<a href

Leave a Comment