Content Signals

Content Signals is a proposed extension to robots.txt that lets website owners declare specific preferences for how AI systems may use their content — separately from whether crawlers can access it at all.

The Problem It Solves

Standard robots.txt only answers one question: can a crawler access this page or not. It has no way to express more nuanced preferences, such as:

"You may crawl this page and use it to answer user questions in real time, but you may not use it as training data for your model."

Content Signals introduces a structured way to declare exactly that distinction.

The Three Signals

Content-Signal: ai-train=no, search=yes, ai-input=yes
SignalControls
ai-trainWhether AI companies may use this content to train models
searchWhether this content may appear in search engine results
ai-inputWhether this content may be retrieved and used as live context when an AI system answers a user's question

Each signal accepts yes or no.

How to Implement It

Add Content-Signal directives under the relevant User-agent blocks in your robots.txt:

User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

User-agent: GPTBot
Content-Signal: ai-train=no, search=yes, ai-input=yes

User-agent: ClaudeBot
Content-Signal: ai-train=no, search=yes, ai-input=yes

See robots.txt for the full directive structure, and AI Crawlers for a complete list of user-agents worth covering.

Choosing Your Settings

Most businesses publishing public-facing marketing content — services, about pages, contact information — benefit from being discoverable and citable, while having less interest in their specific copy being absorbed into model training data.

A common, reasonable default:

Content-Signal: ai-train=no, search=yes, ai-input=yes

This says: don't train on my content, but do show it in search results, and do use it to answer questions about my business in real time.

If you operate a content business where the text itself is the product (journalism, paid research, proprietary analysis), you may want stricter settings:

Content-Signal: ai-train=no, search=yes, ai-input=no

Validating Your Settings

You can check whether your Content Signals are correctly configured using a free scan at isitagentready.com:

POST https://isitagentready.com/api/scan
Content-Type: application/json

{"url": "https://yourdomain.com"}

Check that checks.botAccessControl.contentSignals.status returns "pass".

See Validation for more on confirming your full AI-readiness setup.

Current Standards Status

Content Signals originates from an IETF draft (draft-romm-aipref-contentsignals) submitted by engineers at Cloudflare. As of this writing, the draft has expired and has not been adopted by a formal IETF working group — meaning no AI crawler is required to honor it.

This doesn't make it worthless to implement. Cloudflare's own AI Crawl Control product already supports Content Signals configuration, and early adoption costs only a few lines in your robots.txt. As with llms.txt, being early to an emerging convention is low-cost and positions your site ahead of eventual formalization.

For more on where this fits among other emerging standards, see Standards Glossary.

How AIA Matrix Implements This

AIA Matrix automatically includes Content Signals directives in every generated robots.txt, scoped to each major AI crawler individually. Default settings follow the ai-train=no, search=yes, ai-input=yes pattern, and Professional plan users can request custom settings per crawler through their dashboard.