Amazon investigates AI start-up Perplexity over scraping allegations
What's the story
Amazon Web Services (AWS), the cloud division of Amazon, has launched an investigation into AI search start-up Perplexity, according to Wired.
The probe is centered on allegations that Perplexity violated AWS rules by scraping content from websites that explicitly blocked such actions.
An anonymous AWS spokesperson confirmed the ongoing investigation into the start-up which is backed by Jeff Bezos's family fund and NVIDIA.
Alleged violations
Perplexity accused of violating Robots Exclusion Protocol
Perplexity is suspected of using content from websites that have denied access via the Robots Exclusion Protocol — a common web standard.
This protocol involves placing a plaintext file on a domain to depict which pages should not be crawled/accessed by automated bots and crawlers.
"AWS's terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws," the AWS spokesperson stated.
Scrutiny begins
Practices under scrutiny after recent report
Perplexity's practices were questioned after a June 11 report from Forbes accused the start-up of stealing at least one of its articles.
Wired investigations confirmed this practice and found further evidence of scraping abuse and plagiarism linked to Perplexity's AI-powered search chatbot.
Engineers for Conde Nast, Wired's parent company, have blocked Perplexity's crawler across its websites using a robots.txt file.
IP discovery
Secret IP address linked to web scraping
Wired discovered that Perplexity had access to a server using an unpublished IP address—44.221.181.252—which visited Conde Nast properties hundreds of times in the past three months, seemingly to scrape Conde Nast websites.
This IP address was traced back to an Elastic Compute Cloud (EC2) instance hosted on AWS.
Perplexity CEO Aravind Srinivas stated that the secret IP address observed scraping Conde Nast websites was operated by a third-party firm performing web crawling and indexing services.
Company response
Perplexity responds to Amazon's investigation
Perplexity spokesperson, Sara Platnick, stated that the company responded to Amazon's inquiries and characterized the investigation as standard procedure.
She confirmed that Perplexity made no changes to its operation in response to Amazon's concerns.
"Our PerplexityBot—which runs on AWS—respects robots.txt, and we confirmed that Perplexity-controlled services are not crawling in any way that violates AWS Terms of Service," Platnick said.
However, she admitted that PerplexityBot ignores robots.txt when a user enters a specific URL in their prompt.
Industry reaction
Digital content industry reacts to alleged actions
Jason Kint, CEO of Digital Content Next, a trade association for the digital content industry whose members include The New York Times, The Washington Post, and Conde Nast, expressed concern over the allegations against Perplexity.
"By default, AI companies should assume they have no right to take and reuse publishers' content without permission," Kint said.
His statement implies that if the allegations are true, Perplexity is violating many of their principles.