Tumblr, WordPress data to be sold for training AI tools

By Dwaipayan Roy

Feb 28, 2024

12:54 pm

What's the story

Automattic, the company behind Tumblr and WordPress.com, is in discussions with AI firms Midjourney and OpenAI, to supply training data from user posts, according to 404 Media. These deals are said to be "imminent" and could offer a new revenue source for Automattic. In response, the firm plans to introduce a setting today that lets users opt out of sharing their data with third parties, including AI companies.

Worrying

Data dump from 2014-23 collected

However, 404 Media's report suggests that Automattic has already collected an "initial data dump" of all public Tumblr posts from 2014 to 2023, even those not visible on public blogs. To address these concerns, Automattic released a statement on Tuesday called "Protecting User Choice." The statement refers to partnerships with unnamed AI firms and states that it is currently blocking major AI platform crawlers by default.

Contents

What was present in the Tumblr data dump?

The Tumblr data dump comprised private posts on public blogs, unanswered asks, private answers, and posts on deleted/suspended blogs. Content from "premium partner blogs" (such as Apple's former music blog), and posts that were marked as 'NSFW,' 'explicit,' and 'mature' were also included. However, DMs, password-protected posts, and media violating community guidelines were not present in the data load.

Data

What about WordPress?

WordPress.com has blogs that are hosted as a service by Automattic. Meanwhile, the open-source WordPress CMS (WordPress.org) is used by individuals/businesses on self-hosted websites. It is unclear whether self-hosted WordPress blogs using Automattic plugins, to connect their blogs with Automattic's infrastructure, fall under the company's AI-scraping agreements.

Explanation

Public content to be shared if users don't opt out

Automattic's statement also mentions that it will only share public content from WordPress.com and Tumblr with AI companies if users haven't opted out. It is directly working with certain AI firms that align with community values like attribution, opt-outs, and control. Many companies have made deals with AI tool creators to provide training data, often using publicly available online information.

Issue

Companies struggle to balance user satisfaction and AI experimentation

Reddit has a $60 million yearly agreement with Google, while Shutterstock has partnered with OpenAI to train on its picture library. However, artists and writers have protested against their work being utilized for training purposes, leading to a backlash against platforms like DeviantArt that have experimented with this technology. The specifics of Automattic's potential deal and its financial impact remain undisclosed.