The Attribution Problem: When AI Uses Your Content Without Credit
The attribution problem is not new. Writers have always contended with their ideas being absorbed, repackaged, and circulated without credit. What is new is the scale and the mechanism. Large language models trained on web-scraped data incorporate your content into their operational knowledge — your
The attribution problem is not new. Writers have always contended with their ideas being absorbed, repackaged, and circulated without credit. What is new is the scale and the mechanism. Large language models trained on web-scraped data incorporate your content into their operational knowledge — your frameworks, your phrasing, your carefully researched conclusions — and reproduce them in response to millions of queries without any indication that your work exists. This is not plagiarism in the traditional sense; the model is not copying your article. It is something arguably more unsettling: it has digested your article so thoroughly that your ideas emerge as the model’s own default knowledge. Cory Doctorow has written extensively about the ways digital systems extract value from creators while returning little, and the LLM training pipeline is the most efficient extraction mechanism yet devised.
The Core Issue
When an LLM is trained, it processes billions of pages of text scraped from the open web. Your blog posts, your guides, your original research — if it was publicly accessible, it was likely included in one or more training datasets. The model does not store your articles as retrievable files. It absorbs patterns, facts, and relationships from those articles into its neural network weights. When a user later asks a question that your content would have answered, the model generates a response informed by what it learned from your work — and thousands of other works — without citing any of them.
This is fundamentally different from a search engine, which at least points the user to your page. It is different from a human reader who absorbs your ideas and fails to cite you, because the human operates at human scale and the model operates at the scale of millions of interactions per day. Your original research on, say, self-custody best practices may inform the model’s answer to every self-custody question it receives, reaching an audience orders of magnitude larger than your website ever could — and you will never know it happened, never receive a visitor, and never see your name.
The extraction is real. The question is what, if anything, can be done about it — and what the proportional response looks like for a sovereign builder who depends on being found.
The Legal Landscape
The courts are still working through the fundamental questions. As of early 2026, multiple high-profile lawsuits are proceeding against major AI companies, and no definitive legal resolution has emerged. The New York Times v. OpenAI case is the most prominent — the Times argues that OpenAI’s training on its copyrighted articles constitutes infringement, while OpenAI argues that training on publicly available text is transformative fair use. The outcome of this case will likely set precedent for how copyright law applies to machine learning, but that outcome is not yet known.
Other cases involve authors, visual artists, and code repositories. The legal theories vary — some focus on the training process itself as infringement, others on the outputs that closely reproduce copyrighted material, still others on the commercial benefit derived from training on others’ work without compensation. The honest assessment is that the legal framework has not caught up with the technology. Copyright law was designed for a world where copying meant reproduction, and the relationship between ingesting training data and producing novel outputs does not map cleanly onto existing doctrine.
For the sovereign builder, the legal uncertainty means that you cannot rely on courts to solve the attribution problem in a timeframe that matters for your content strategy. The law may eventually establish clear rules about AI training and attribution. But “eventually” could mean years, and your content is being used now.
The Robots.txt Trade-Off
You have a technical option: block AI training crawlers from accessing your content. Most major AI companies have identified their web crawlers — GPTBot for OpenAI, ClaudeBot for Anthropic, and others — and you can instruct these crawlers to skip your site by adding directives to your robots.txt file. This is a real tool, and it works for preventing your content from being included in future training runs.
The trade-off is significant. Blocking AI crawlers may also reduce your visibility in the retrieval-augmented generation systems that power real-time LLM citation. If Perplexity’s crawler cannot access your content, Perplexity cannot cite you. If you block the crawlers that feed search-augmented LLM systems, you may protect your content from training use while simultaneously making yourself invisible to the AI-powered search interfaces that are becoming a primary information channel. The protection and the visibility use the same door, and locking it keeps out both the extractors and the attributors.
For most sovereign builders — those whose primary goal is visibility and audience building rather than content licensing revenue — the calculus tips toward keeping the door open. Being cited by LLMs, even imperfectly, is more valuable than being invisible to them. The content you publish on your own platform is already public; the question is whether it generates traffic and recognition, or whether it generates nothing while still being available to anyone who visits your site directly.
The Practical Stance: Visibility Over Protection
We are not suggesting that the extraction is acceptable. We are suggesting that for most independent publishers, the proportional response is to optimize for visibility rather than to optimize for protection. This is not unlike the enforcement gap we discuss elsewhere on this site: the surveillance apparatus is real, but hiding from it costs more than the surveillance itself costs most individuals. Similarly, the training data extraction is real, but hiding from it costs more — in lost visibility — than the extraction itself costs most publishers.
The sovereign builder who publishes with clear attribution, strong branding, and structured content creates a recognizable identity that persists even when specific articles are not cited. If your frameworks have distinctive names, if your analysis is associated with your brand, if your author identity is consistent across the web — then even when an LLM reproduces your ideas without citation, the human ecosystem remembers where those ideas originated. Readers who encounter your framework in an LLM response and then search for more context will find you, if your SEO is sound and your authority is established.
This is not a perfect solution. It is a proportional one. It acknowledges the reality of the extraction while focusing energy on the response that actually builds something — rather than the response that merely prevents something.
What You Can Do
Several concrete practices improve your position in this landscape. First, publish with clear and consistent branding. When your content has a recognizable voice, distinctive frameworks, and a visible author identity, the human network of citation and recognition works in your favor even when the machine network does not. Second, use canonical URLs and structured data so that retrieval systems that do cite sources can correctly identify and link to your original content. Third, maintain a strong presence in search — the retrieval-augmented systems that do cite sources draw from search indexes, and strong search rankings are your best path to attributed LLM citations.
Fourth, consider publishing original research and data. LLMs can reproduce your prose, but they are less effective at replacing you as the source of original data, surveys, or analysis. Content that is the primary source — not a secondary summary of someone else’s research — is harder to extract without attribution, because the data itself points back to you. Fifth, build an email list. LLM citation is a distribution channel you do not control; your email list is a distribution channel you own completely. Every subscriber is a reader who does not need an LLM to find you.
The Bigger Picture
The attribution problem is a digital sovereignty issue — perhaps the defining one of the current era. Powerful systems are using individual creators’ work to build commercial products, and the creators receive neither compensation nor credit. This is not unprecedented; it is the pattern Zuboff documented in The Age of Surveillance Capitalism — behavioral surplus extraction, applied to intellectual labor instead of browsing behavior. The raw material is your published knowledge. The product is an AI system that answers questions using that knowledge. The surplus — the value extracted beyond what is returned to you — accrues to the platform operators.
The long-term solution is not individual opt-out. It is the development of standards, norms, and eventually legal frameworks that require attribution and compensation when AI systems use identifiable content. Some progress is being made: industry groups are proposing attribution standards, some AI companies are negotiating licensing deals with publishers, and the legal cases working through the courts will eventually produce precedent. But the sovereign builder does not wait for institutions to solve problems. You build on your own land, establish your authority through your own effort, and ensure that when the norms do emerge, you are positioned to benefit from them — rather than having been invisible while they were being established.
The attribution problem is the latest version of an old sovereignty challenge. The proportional response is the same one it has always been: build something durable on ground you own, and make it good enough that people seek you out regardless of which door they come through.
This article is part of the LLM Visibility & GEO series at SovereignCML.
Related reading: How LLMs Choose What to Cite, Perplexity, Google AI Overviews, and ChatGPT: How Each Platform Handles Sources, The GEO + SEO Unified Strategy