Photo: Koshiro K / Shutterstock.com
In its quest for training data, OpenAI developed the Whisper audio transcription model, transcribing over a million hours of YouTube videos to train its GPT-4 language model. The need to head to YouTube for data scrapping stems from the fact that the available data will be exhausted by 2026.
While this approach raised legal questions, OpenAI considered it within the bounds of fair use. This pursuit of data, however, presents various challenges and ethical considerations as companies navigate legal complexities and explore innovative solutions to sustain AI, reported The Verge.
OpenAI president Greg Brockman personally oversaw the data crawling process due to its high importance to the company.
Over the past 18 months, the demand for online data has surged significantly for artificial intelligence (AI) development, prompting major tech players like Meta, Google, and OpenAI to compete for access to valuable data sources.
OpenAI spokesperson emphasised the company’s efforts to curate unique datasets tailored to each model, drawing from publicly available data and partnerships for non-public data. The company is also exploring generating synthetic data to augment its resources.
As per The New York Times, synthetic data is information that is generated by AI tools and not based on the real world. This could be dangerous and full of errors.
Another key player in the AI domain, Google acknowledged reporters of OpenAI’s activities related to YouTube content but reiterated its policies prohibiting unauthorised scraping or downloading of such data. While Google may take “technical and legal measures” to stop the unauthorised data scrapping, some sources say that the company has been gathering YouTube transcripts from some YouTube content per the agreement between the company and creators.
Meta, the third player, encountered similar challenges in sourcing high-quality data. Discussions within Meta’s AI team revealed considerations of using copyrighted works without permission as they sought to catch up with competitors like OpenAI. The company explored options such as purchasing book licenses or acquiring a publisher, reflecting the industry’s desperation for robust data training.
Meta also works on an open-source Artificial General Intelligence (AGI) model.
The broader AI community is grappling with the looming scarcity of quality training data, with projections suggesting a shortfall by 2028. Proposed solutions include training models on synthetic data or adopting curriculum learning approaches, though these methods are not without their uncertainties.
As AI companies continue to innovate, the debate over user privacy, copyright, intellectual property rights, and other ethical issues surrounding AI development is likely to become more prominent. Technological advancements must be balanced with ethical and legal considerations so that no one feels cheated.
In the News: Apple iPhone 16 is coming in September: All you need to know