GPTBot web crawler will collect publicly available data from the world wide web to improve future ChatGPT models
Artificial intelligence firm OpenAI has introduced “GPTBot” — the new web crawling tool which could potentially be used to improve future ChatGPT models.
Web crawlers are specific bots that index the content of varuious websites across the internet. They are often used by search engines to display relevant content in search results.
GPTBot is suppised to collect publicly available data from the internet, except for sources that require paywalled content, gather personally identifiable information, or include text that violates the firm’s policies.
In addition, website owners themselves can deny the web crawler using their resources by adding a “disallow” command to a standard file on the server.
The scraped data will be used to train future generative AI models like the anticipated GPT-5 or enhance the capabilities of the existing OpenAI tools.
At the same time, the launch of the new versions of the popular ChatGPT tool is hindered by legal obstacles. At present, the US Federal Trade Commission (FTC) is investigating OpenAI practices for processing personal data, trying to estimate the likelihood of providing inaccurate answers in response to requests, and the risks of harm to consumers, including damage to reputation.
Earlier in June, a class-action lawsuit against OpenAI was filed in California. The firm allegedly scraped private user information from the internet to train its artificial intelligence (AI) chatbot ChatGPT.
Other regulators across the globe like the Dutch privacy regulator Data Protection Authority have also requested clarification on how the company processes the personal data of consumers during the training of its basic system.
Facing the new challenges presented by generativeAI, global legislators are urged to create a regulatory framework for overseeing the operation of ChatGPT-like chatbots and their handling of sensitive data.