I know we can’t do this with any copyrighted materials. But a lot of books, music, art, knowledge is in the creative commons. Is it possible to create one massive torrent that includes all that can be legally included and then have people only download what they actually want to enjoy?
so per your suggestion using for example the zlibrary book/paper repo and training sets of openai as starting point one could maybe get around the brunt of the work.
ZLibrary isn’t something that pays attention to licensing. It’s mainly copyrighted and pirated material.
I meant something like the dump of wikipedia, project gutenberg, and whatever archive.org has available tagged with some favorable licenses.
I think there are datasets compiled with sources like those. I’m not an expert on this, something like RedPajama just without random web-scraping.
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research