Cloudflare has turned its attention from erecting bot barriers to dangling bot bait.
Having previously devised a mechanism to make AI crawlers pay to consume website content, the content delivery network is now offering web publishers a way to make it cheaper for AI services to harvest site content by converting HTML to Markdown, the minimalist markup language for representing text mixed with formatting characters in a way that retains legibility.
In a blog post, Cloudflare engineering director Celso Martinho and VP Will Allen explain that AI crawlers and software agents, which constitute a growing portion of web traffic, find it easier to digest documents formatted in Markdown than traditional HTML web pages.
The reason is that HTML web pages often contain a lot of characters that describe formatting and identifiers unrelated to the semantic content, and chewing through all those tags and related markup such has a computational cost.
“Feeding raw HTML to an AI is like paying by the word to read packaging instead of the letter inside,” explain Martinho and Allen. “A simple ## About Us on a page in markdown costs roughly 3 tokens; its HTML equivalent –
About Us – burns 12-15, and that’s before you account for the wrappers, nav bars, and script tags that pad every real web page and have zero semantic value.”
To make web content easier for AI crawlers to chew, Cloudflare’s network can now respond to crawler network requests in Markdown rather than HTML. To make this happen, an AI crawler in its network negotiation submits the Accept negotiation header with text/markdown as one of the options.
If a site publisher enables Markdown, Cloudflare’s network will answer with a response header formatted in the language, plus an x-markdown-tokens header that includes the token count. That’s potentially useful for calculating whether the incoming content will fit within the model’s context window or whether it needs to be broken up into a series of smaller chunks.
For a web page like the Cloudflare blog post, Markdown delivery reduces the number of tokens used from 16,180 in HTML to 3,150 in Markdown, a savings of 80 percent.
The Markdown option, which is available for HTML but not other document formats like PDF, complements another recently deployed capability, the company’s Content Signals Policy.
Content Signals Policy is a framework for adding machine-readable instructions to a website’s robots.txt file, an implementation of the Robots Exclusion Protocol that allows publishers to communicate how they expect bots and crawlers to engage with their site. It exists to specify content usage preferences more precisely.
A site’s Content Signals Policy is expressed in a robots.txt directive that declares three key-value pairs. For example:
User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /
The parameters specify whether content can be used for AI training, for AI search, and for AI input (post-training uses like retrieval augmented generation or model grounding).
As part of robots.txt, Content Signals Policy directives are voluntary; they do not represent technical protection measures.
According to Martinho and Allen, coding agents like Claude Code and OpenCode already ask for Markdown in their Accept headers. Web publishers can now cater to automated clients if they choose. ®
Share