The New York Times Won’t Allow AI Companies to Scrape Its Data

As convenient as AI tools are, companies behind them are still facing criticism over how they acquire the data used to train the AI models. AI giants like OpenAI have been accused of illegally scraping information off of the internet, and publications like The New York Times are taking measures to avoid that.

The New York Times' New Terms of Service

The publication is getting ahead of any issues by revising its Terms of Service and avoiding other companies from using its data to train algorithms. Now, it clearly states that any data from its archives cannot be used to train "any software program."

That includes but is not limited to, training a machine learning or artificial intelligence system. The data covered by the new policy are content produced by The New York Times such as texts, photos, video, and metadata, according to Gizmodo.

In fact, the change comes just after AI companies were reported to be proposing collaborations with certain news media outlets. The companies are doing so by offering free services and partnerships in order to have one foot in the news market.

Even Google tried approaching news companies like The Times and Washington Post, claiming that they have an AI tool called "Genesis" that can help journalists with their jobs. Some would argue that it's just a way for AI companies to legally scrape data.

Associated Press already made a deal with OpenAI where it will provide the publication access to the company's "technology and product expertise," all while it gets Associated Press' text archives, which can then be used to train AI models.

Others Have Done It

The New York Times is not the only company or organization that's wary of the issue. Social media platforms are more at risk of having their data stolen. Among the platforms that have been vocal about AI companies illegally using their data for AI training are X and Reddit.

Elon Musk, owner of X which was formerly known as Twitter, has expressed that AI companies have been using X data to train their AI models, which led to him imposing a read limit for all users. Verified and unverified users have different view limits per day but serve the same purpose.

Reddit, on the other hand, has had a rougher transition to avoid such matters from happening. In order to avoid AI companies from obtaining data without their permission, the company implemented a new API policy.

Third-party developers now had to pay a fee to gain access to Reddit's API. Moderators were not to happy about the change, especially since it affected the apps they used to enhance their experience and roles in the social networking site.

Not being able to keep up with the prices, some of the third-party apps were forced to shut down, which follows the protest initiated by subreddit moderators. It has since died down, but not before Reddit's operations were inconvenienced by it.

Regulations are still being drawn up to let limitations for the use of AI, as well as the kind of data ot can use for machine learning. Given that generative AI is fairly new compared to the basic AI used in certain systems, there are still a lot of topics to deliberate.