Skip to content

Nvidia scrapes content from YouTube and Netflix for AI training

  • by
  • 3 min read

Nvidia, a leader in AI technology, has been using a controversial approach to compile training data for its AI products: scraping videos from platforms like Netflix and YouTube. This raises significant legal and ethical concerns.

Internal Slack chat, emails, and documents obtained by 404Media reveal that Nvidia’s employees were instructed to scrape video content from YouTube, Netflix and other sources to build datasets for various AI models.

These datasets were crucial for developing Nvidia’s Omniverse 3D world generator, self-driving car systems, and digital human products.

The project, known internally as Cosmos, aims to create a comprehensive video foundation model that integrates simulations of light transport, physics, and intelligence. Despite the ambitious goals, the project has yet to be publicly disclosed.

When employees questioned the legality of using copyrighted content without explicit permissions, project managers reassured them that the company’s highest levels had cleared the practice.

Nvidia maintains that its practices comply with copyright laws, stating that it respects the rights of content created and that its use of content is protected under the fair use doctrine.

“Copyright law protects particular expressions but not facts, ideas, data, or information. Anyone is free to learn facts, ideas, data, or information from another source and use it to make their own expressions,” Nvidia clarified. “Fair use also protects the ability to use a work for a transformative purpose, such as model training.”

AI models have an insatiable hunger for data, and for this, companies often venture into the grey areas of data scraping.

However, to many experts, this stance is contentious. Emails and Slack messages show that Nvidia employees used tools like yt-dlp and virtual machines to bypass content protections and download vast amounts of video data. For instance, Ming-Yu Liu, vice president of Research at Nvidia, mentioned using 20 to 30 virtual machines to download 80 years’ worth of videos per day.

With the increasing penetration of AI in various fields, AI companies are fishing around for more data to refine their models further. OpenAI also used this tactic and trained GPT-4 on over a million hours of YouTube videos.

The tech industry has an ongoing debate regarding the ethical and legal implications of using publicly accessible data to train AI models. While Nvidia asserts that its practices adhere to legal requirements, this view is not universally accepted. Google and Netflix have clearly stated that using their content for AI training violates their terms of service.

The Nvidia case is just the tip of the iceberg of a larger issue. AI models have an insatiable demand for data to train them, which often leads them to navigate a legal grey area to acquire it.

Publications like The New York Times and eight newspapers owned by Alden Global Capital have sued OpenAI for copyright infringement.

Recently, OpenAI entered into data-scraping agreements with Reddit, News Corp, Stack Overflow, and other publications and platforms.

In the News: Proton VPN deploys servers in 12 repressive countries

Kumar Hemant

Kumar Hemant

Deputy Editor at Candid.Technology. Hemant writes at the intersection of tech and culture and has a keen interest in science, social issues and international relations. You can contact him here: kumarhemant@pm.me

>