OpenAI, Google Reportedly Used YouTube Video Transcriptions as Training Data for AI

Ever since AI companies started introducing their own large language models (LLMs), issues about where they get their training data emerged just as quickly, with several being accused of using copyrighted data without the creators' consent. Even with video transcriptions, OpenAI and Google might find themselves in another infringement case.

OpenAI, Google Using YouTube Video Transcriptions

LLMs need to be trained on massive datasets to ensure that they can process and understand complex prompts from their users. In a way, the more data it is trained on, the more accurate and informative generative content will be.

However, there's the issue of using copyrighted works to make AI models smarter, and although YouTube and Google are under the same company, it still violates creators' copyrights as the transcriptions are based on the dialogue or content they came up with.

As mentioned in Engadget, OpenAI used its Whisper speech recognition tool to transcribe over a million hours of YouTube videos, which was then used to train its latest AI model, GPT-4. Previously, OpenAI was already accused of using YouTube videos and podcasts as well to train two AI models.

Ironically, Google itself says that "unauthorized scraping or downloading of YouTube content" is not allowed." the company's spokesperson Matt Bryant said that they were unaware that OpenAI was scraping data from the streaming site.

Google allegedly looked the other way as OpenAI used YouTube content to train AI models as the search engine giant was scraping data as well. In Google's defense, it said that the company was only using videos from creators who consented to their content being used.

In July 2023, Google changed its privacy policy which covers its publicly available content also found in Google Docs and Google Sheets, detailing how it cannot be used to train AI models unless users opt into Google's experimental features tests.

OpenAI vs The New York Times

This isn't the first time that the AI company's use of copyrighted data has been brought up. In fact, news outlet The New York Times has also accused OpenAI of infringing on its copyrights as it used millions of the former's articles to train the latter's AI.

OpenAI reasoned that using copyrighted works to train its AI technology is considered fair use under the law, detailing how it collaborates with news organizations like The Associated Press and even made partnerships with others.

The company stated that they "look forward to continued collaboration with news organizations, helping elevate their ability to produce quality journalism by realizing the transformative potential of A.I.," as reported by The New York Times.

However, the mentioned news organization still filed a lawsuit against OpenAI. The AI company said that The New York Times was not telling the full story, even stating that the lawsuit against them was "without merit."

The issue lies with OpenAI's generative AI tools producing results that are exactly the same as copyrighted content. However, the company said that it was a "rare bug" and that some researchers are intentionally manipulating its models to produce that outcome.

© 2024 iTech Post All rights reserved. Do not reproduce without permission.

More from iTechPost

Real Time Analytics