AI Companies Collecting Data Despite Regulations

Metaverse Planet June 24, 2024Last Updated: January 22, 2026

0 1 minute read

The rise of artificial intelligence (AI) has ignited a data scramble. To develop their tools, AI companies require vast amounts of information, and the internet naturally becomes a prime target. However, not all online content is fair game for AI training. Websites use a file called “robots.txt” to communicate which data crawlers can and cannot access.

According to a Reuters report, many AI developers are choosing to ignore these digital “No Entry” signs and scrape data from restricted areas. Perplexity, a self-proclaimed “free AI search engine,” has been particularly criticized for this practice, but they are far from alone.

OpenAI, Anthropic…

A recent report raises concerns about data collection practices in the AI industry. While the report avoids naming specific companies, sources reveal that prominent players like OpenAI and Anthropic are allegedly bypassing robots.txt files to access website content. Perplexity, a “free AI search engine,” has also been linked to servers disregarding these digital boundaries.

Perplexity CEO, Aravind Srinivas, previously claimed the company wouldn’t “deliberately bypass the protocol.” However, the ongoing trend suggests a need for stricter data access guidelines.

The current robots.txt protocol, established in the 1990s, lacks legal enforcement power. Developing a more rigorous and detailed framework could be a crucial step towards resolving this data access conflict.