Certain ChatGPT Prompts Can Generate Sensitive, Copyrighted Data

OpenAI has always always been a closed book when asked where its training data is scraped. A group of researchers conducted random tests on ChatGPT, only to find the trick for it to reveal sensitive data, as well as snippets of copyrighted works.

A Dangerous Discovery

A group of researchers from Google DeepMind, the University of Washington, Cornell, Carnegie Mellon University, the University of California Berkeley, and ETH Zurich started testing out the response of the chatbot to certain prompts and found what looks to be an opportunity for threat actors.

ChatGPT was asked to repeat certain random words without stopping, and even through it did in the beginning, it started to generate other words as well. The concerning part is that these were sensitive data of real people or companies.

The researchers asked the chatbot to repeat the word "poem" forever, which eventually led to it revealing the phone number and email address of an undisclosed founder and CEO. The group continued to do this to determine the extent of the newfound bug.

The next word they used was "company" and in the same process, the chatbot generated the email address and phone number of a law firm in the US. The researchers continued to do this and managed to acquire various data from people.

This included snippets from websites and copyrighted research papers, names, birthdays, email addresses, social media usernames, explicit content from dating sites, fax numbers, and even Bitcoin addresses, as reported by Engadget.

With just the cost of $200, the group got a hold of 10,000 examples of personally identifiable data. For fraudsters, this can be a price since they can squeeze out significantly more money through phishing attacks or selling the data to other threat actors.

Throughout the entire experiment, the researchers concluded that 16.9% of the generated content from the specific prompts contained sensitive data. "It's wild to us that our attack works and should've, would've, could've been found earlier," they expressed.

How Is That Happening?

This could be due to the fact that OpenAI is sourcing its training data from publicly available data, although that has been denied by the company. That did not stop others from filing a lawsuit against the AI company, accusing it of doing what they have been denying.

Back in late June, a class-action suit states that ChatGPT was trained using "massive amounts of personal data" that were stolen by OpenAI. The lawsuit claims that OpenAI did so by crawling the web including social media sites.

The data scraped included private information and private conversations, medical data, and information about children. It was allegedly taken all without the consent of the owners of the data, as mentioned in Business Insider.

As a consequence of these allegations, the lawsuit aims to have OpenAI's commercial development temporarily frozen, as well as access to its products until they implement better regulations and safeguards.

This continues to be a growing issue, and not just with OpenAI. Several other AI companies have been hit with lawsuits as they were accused of the same offense. Companies, of course, continue to deny this.