AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms

Share This Post

Non-profit AI research group EleutherAI scraped YouTube subtitles to create a dataset in violation of YouTube’s terms of service, ProofNews said on July 16.

The dataset, called the Pile, allegedly includes subtitles of 173,536 YouTube videos from over 48,000 channels. About 12,000 deleted videos are part of the dataset.

Several top tech and AI firms, including Anthropic, have since used the Pile for training. Anthropic spokesperson Jennifer Martinez said the dataset includes “a very small subset of YouTube subtitles” but declined to comment on possible violations of YouTube’s terms of service.

Business software firm Salesforce also used the dataset. Salesforce VP of AI research Caiming Xiong said the dataset was “publicly available” and that Salesforce used it for academic and research purposes. ProofNews said Salesforce eventually released the same dataset publicly.

Apple used the Pile to train OpenELM, an efficient language model for on-device AI. Nvidia, Bloomberg, and Databricks also used the Pile for AI training.

ProofNews said its list of companies that used the dataset is not comprehensive, as companies do not always disclose which datasets they use in AI training.

Dataset contains crypto channels, more

ProofNews’ search tool indicates that Pile includes videos from crypto channels and creators, including Coinbase, Cointelegraph, Bitcoin Magazine, BitBoy Crypto, 99Bitcoins, Ivan On Tech, and Andreas Antonopolous.

ProofNews highlighted that the dataset includes transcripts from major news channels, education channels, late-night shows, popular YouTube hosts, and other categories. The Pile dataset extends beyond YouTube to other websites and online content.

ProofNews noted an earlier report from the New York Times, which said OpenAI and Google had previously harvested YouTube text. Google, which owns YouTube, said the action was permissible due to its agreement with users. OpenAI did not confirm or deny the report.

AI copyright disputes are far-reaching. Law firm Baker Hoestler lists at least fifteen lawsuits involving tech firms such as Anthropic, Meta, GitHub, Stability AI, Nvidia, and Google. OpenAI faces high-profile lawsuits from Mother Jones’ parent company and The New York Times.

The post AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms appeared first on CryptoSlate.

Read Entire Article
spot_img
- Advertisement -spot_img

Related Posts

North Korea’s Lazarus Group Moves 400 ETH to Tornado Cash Following $1.5B Bybit Hack

North Korea-linked Lazarus Group transferred 400 ETH (worth $750,000) to Tornado Cash on March 12, 2025, likely laundering proceeds from the record $15 billion Bybit hack, blockchain researchers

Bitcoin Boost: Cathie Wood’s Ark Invest Drops $80 Million On BTC—Bullish Signal?

According to reports, Ark Invest has added 997 Bitcoin (BTC) to its holdings in two separate transactions The asset management firm, led by Cathie Wood, made the purchases on March 13, 2025, through

Hayden Davis Faces Interpol Alert For Role In LIBRA Memecoin

An Argentine prosecutor has requested a judge to issue an Interpol “Red Notice” for Hayden Davis, an American citizen allegedly involved in the launch of the controversial LIBRA memecoin that has

Senate Fast-Tracks Stablecoin Legislation Despite Opposition From Elizabeth Warren

For the future of crypto regulation, the US Senate Banking Committee recently advanced the Guiding and Establishing National Innovation for US Stablecoins Act, commonly known as the GENIUS Act, with

Excessive fundraising weakens VC confidence in crypto startups despite regulatory improvements

Venture capital funding for crypto startups has yet to rebound in line with recent regulatory clarity in the US despite showing signs of recovery in the months following President Donald

Blackrock’s BUIDL Soars 50% in 6 Days—Tokenized Fund Now Commands $1B AUM

Six days ago, on March 8, 2025, the tokenized Blackrock USD Institutional Digital Liquidity Fund (BUIDL) held $668 million in assets under management (AUM) Since then, its AUM has soared by 503%,