Revealed! Haruki Murakami, Stephen King pirated books become AI training data, even AI giants can't escape this fate
In order to train large-scale language models, companies including OpenAI, Meta, Google, and Microsoft have collected large amounts of copyrighted works from the Internet without permission, navigating the gray area of copyright law.
OpenAI is currently facing a slew of lawsuits, with plaintiffs alleging that most of the books in the company's training dataset come from pirated sources and unauthorized websites. Once found guilty of copyright infringement, the company could risk huge fines or redesigning its algorithms. This has led to a growing reluctance among AI companies today to shareAI trainingData details.
However, some openly pirated corpora have raised concerns.
A dataset called Book3, which contains nearly 200,000 books covering bestselling authors such as Haruki Murakami and Stephen King, has recently been discovered, which is used to train AI models and has recently come under frequent attack by anti-piracy organizations.
The copyright issue is like a sharp blade hanging over the heads of AI companies and the situation is precarious.
It has always been the case that the training data for AI models is not entirely transparent. This year, a number of American authors banded together to file a lawsuit against OpenAI, accusing it of copyright infringement and violating several laws by using pirated books for language model training.
These writers provide some simple evidence to support their claims. Because they never gave OpenAI permission to use their work, yet ChatGPT was able to provide an accurate summary of their work, which led them to believe it was taken from somewhere.
In 2020, open source AI proponent Shawn Presser uploaded a dataset called 'books3'
More than 10,000 writers have banded together to urge AI companies to stop using their work without permission. These people don't want their writing styles mimicked by AI unless the tech companies pay for it.
The Writers Guild of America has sent an open letter to the CEOs of a number of tech giants, including OpenAI, Google, Meta, Stability AI, IBM, and Microsoft, asking them to either stop using their work without permission or compensate them accordingly for the use.
This year, lawsuits have been filed across the United States against OpenAI, Meta, and other tech giants for allegedly using the work of thousands of authors to train big language models without their consent or authorization. This lawsuit involves a massive industry, and more content creators are expected to potentially take legal action.
In addition to these giants, other generative AI companies have been involved in copyright disputes. For example, Stability AI, the company behind Stable Diffusion, has been sued for training on the LAION-5B dataset. The dataset contains more than 5.85 billion image-text pairs, most of which are protected by copyright.Getty Images is suing Stability AI for allegedly training AI image generation models on more than 12 million Getty Images without permission.
Many artists and related stakeholders have also filed copyright infringement lawsuits against companies such as Stability AI, DeviantArt, and Midjourney. They allege that these companies have violated their copyrights, likeness rights, and engaged in unfair competition and unjust enrichment, and are seeking damages and injunctions.
In terms of current public opinion, while some people are concerned that training AI may raise copyright issues, others hold a different view. They argue that AI companies like OpenAI don't need special license agreements to train models, and that copyright concerns are detrimental to the progress of AI. On the other hand, some argue that obtaining author's consent is crucial, that creators should have the right to refuse, or that AI companies should at least purchase books of training data.
Technology is creating things never before seen in human history. Should there be a bottom line for the spirit of open source when it comes to AI training data? Will future laws restrict or protect? How to balance the development of AI with respect for the rights of human creativity may be as important as the question of "when will general artificial intelligence arrive?