DeGenTWeb: A First Look at LLM-dominant Websites
Sichang Steven He, Calvin Ardi, Ramesh Govindan, Harsha V. Madhyastha
TLDR
DeGenTWeb systematically identifies LLM-dominant websites, revealing their high prevalence and growth, while noting detection challenges with newer LLMs.
Key contributions
- Developed DeGenTWeb, a system for systematically identifying LLM-dominant websites.
- Adapted LLM text detectors for web pages and aggregated results for site-level categorization.
- Discovered high prevalence and growth of LLM-dominant sites in Common Crawl and Bing search.
- Highlights increasing difficulty in detecting LLM-generated content due to advanced models.
Why it matters
Current claims about LLM content on the web lack robust methodology. This paper provides a systematic approach to quantify LLM-dominant websites. Its findings reveal the true scale of LLM content and the growing challenge of detection, impacting web integrity and search relevance.
Original Abstract
Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.