Image by Marisa Sias | Pixabay

Harvard University has released a dataset comprising nearly one million public-domain books scanned during the Google Books project. Backed by funding from Microsoft and OpenAI, the dataset is a part of the newly established Institutional Data Initiative (IDI). It represents a significant resource for developers of large language models and other AI tools.

The IDI dataset is five times larger than its predecessor, the Books3 dataset. It spans an extraordinary range of genres, languages, and eras, featuring works by literary giants like Shakespeare and Dante alongside niche materials such as Czech mathematics textbooks and Welsh dictionaries, reports Wired.

Greg Leppert, executive director of IDI, emphasised that the project provides access to highly created and refined content that has traditionally been the domain of major tech firms. The initiative, he added, has undergone a rigorous review process to ensure quality and usability.

Lepper envisions the dataset as a cornerstone for AI development akin to Linux’s role in global software ecosystems. While companies will still need to incorporate additional training data for competitive differentiation, the dataset offers a robust starting point for smaller players in the AI industry and individual researchers.

Microsoft has been a key supporter of the project, aligning it with its broader philosophy of creating open data ecosystems to foster innovation. While noting the importance of accessible data pools for startups, Microsoft clarified that this does not signal a shift away from proprietary data in Microsoft’s AI training efforts.

Photo: tada images / shutterstock. Com — *The release of open-source AI training datasets challenges the notion that scraping copyrighted materials is the only way forward. | Photo: Tada Images / Shutterstock.com*

The project comes at a critical juncture as legal battles over the use of copyrighted materials in AI training rage on. The outcome of these cases could reshape the AI landscape, with companies potentially forced to rely more heavily on public domain resources like the Harvard dataset.

Beyond books, the IDI has ambitious plans to scan millions of public-domain newspaper articles in collaboration with the Boston Public Library. The initiative is open to forming similar partnerships to expand its offerings of accessible, high-quality datasets.

However, questions remain about the dataset’s distribution. While Harvard is optimistic about partnering with Google for public release, no agreement has been finalised.

AI ethics researchers believe these open-source datasets pose a direct challenge to the notion that scraping copyrighted works is necessary for building AI models. OpenAI has argued that copyrighted materials are indispensable for products like ChatGPT, but projects like IDI undermine this defence.

However, researchers also conclude that if this dataset is mixed with unlicensed work, the benefits will largely accrue to AI companies rather than creators.

In the News: iPhone SE 4 case leaks hints at upcoming design change