Summarize

Spotting AI author: Study reveals patterns that expose machine-generated text

By Akash Pandey

Jul 04, 2024

06:23 pm

What's the story

A recent study has found that large language models (LLMs) like ChatGPT often overuse certain words, potentially pointing toward their limited vocabulary. The researchers likened this "excess word usage" in biomedical papers to the way doctors measured COVID-19's impact through "excess deaths." The study suggests that around 10% of abstracts in 2024 were processed with LLMs.

Vocabulary shift

Unprecedented effect of LLMs on scientific vocabulary

The researchers noted that the influence of LLM usage on scientific writing is "truly unprecedented and outshines even the drastic changes in vocabulary induced by the COVID-19 pandemic." The researchers used a novel approach, measuring "excess word usage" in biomedical papers similar to how doctors track "excess deaths" in epidemiology. The research involved a thorough analysis of 14 million biomedical abstracts published between 2010 and 2024.

Word usage

LLMs led to increased usage of certain words

The team used papers published before 2023 as a baseline for comparison with those released during the widespread commercialization of LLMs. They found that less common words such as "delves" are now used 25 times more frequently, while others like "showcasing" and "underscores" saw a ninefold increase. Even common words like "potential," "findings," and "crucial" experienced an uptick in usage by up to 4%.

AI markers

Excess word usage: A marker of AI influence

In their search for excess word usage between 2013 and 2023, the researchers identified terms related to global events such as "ebola," "coronavirus," and "lockdown." However, in 2024, the excess words were mostly style words rather than content words. Of the 280 excess style words identified that year, two-thirds were verbs and about a fifth were adjectives.

Global impact

AI-processed papers more prevalent in non-English speaking countries

The researchers used these excess style words as markers of ChatGPT usage, estimating that around 15% of papers published in non-English speaking countries like China, Taiwan, and South Korea are now AI-processed. This is higher than in English-speaking countries like the United Kingdom, where the rate is 3%. They acknowledged that native English speakers might be better at concealing their LLM usage.