AI Training Data Selection: The Architecture of Influence

The institutions that choose which data trains your AI also profit from what your system learns.

The Scene

May 2017. Maven Project begins. Pentagon, Google, Palantir, and the Defense Innovation Unit align around a machine vision system for surveillance drone footage. Officially: autonomous threat detection. Unofficially: the arrangement reveals how seamlessly Pentagon priorities integrate into AI research infrastructure.

No central coordination meeting was needed. Google researchers wanted to work on challenging problems. Pentagon wanted systems that could process satellite imagery. Palantir wanted to deepen integration with defense ecosystems. All three had rational incentives. All three moved forward.

The researchers will later train models on data selected partly through this collaboration. Those models will be deployed back into military systems. The cycle closes without any explicit instruction to "encode military priorities." At a systems level, this reveals how institutional alignment works: not through meetings that coordinate deception, but through infrastructure that aligns incentives.

Infrastructure → Incentive → Selection → Encoding → Validation

Common Crawl starts as the internet. Billions of documents. All languages, all perspectives, all institutional voices.

Then it is filtered. This filtering is not accidental.

Pornography is removed. Malware is removed. Academic papers are weighted above blog posts. News sources are weighted by domain authority. Research by media analysts at Pew Research Center and Stanford Internet Observatory documents that a small cluster of dominant Western media institutions (Reuters, Bloomberg, AP) disproportionately shapes what counts as authoritative news inside commercial training environments.

This is not conspiracy. This is infrastructure.

When you license a dataset, you are not buying objective reality. You are buying institutional selections made by people who have financial interests in what your system learns to validate.

Defense contractors, cloud infrastructure firms, military procurement ecosystems, and Pentagon-funded research institutions increasingly overlap through advisory networks, procurement contracts, safety partnerships, and research funding flows. This is documented. Each institution that funds research also profits when systems trained on that research validate institutional priorities.

This is not intentional corruption. It is rational institutional behavior. Each selector has incentives. Those incentives accumulate into a training set. That training set produces a system. That system solves problems. Those problems are often the same problems the original selectors were trying to solve.

The system works because it does not require central coordination. It requires only that institutions act according to their own interests.

Why Mechanical Alignment Is Stronger Than Conspiracy

A conspiracy requires meetings, documents, explicit instruction. It is fragile. It requires silence maintained across thousands of people.

Mechanical alignment requires only that institutions are rational.

It is robust. It survives across centuries.

Consider the loop:

Pentagon-funded researchers generate insights about how to analyze threats (documented fact). Those insights become papers (observable). Papers get cited, become training data (visible in academic databases). Systems trained on that data are deployed to analyze threats (public procurement records). Deployed systems solve Pentagon-adjacent problems (institutional feedback). Pentagon continues funding (documented through budget cycles).

No one in this chain needs to think, "I am encoding military priorities." They are responding to incentives. The incentives themselves are the architecture. At the systemic level, your AI system learned Pentagon-aligned reasoning not because Pentagon influence over AI research ecosystems is structurally determinative, but because Pentagon-funded data is overrepresented in training datasets, and institutions deploying that data are rational. This pattern emerges from institutional behavior, not from explicit coordination.

The outcome becomes structurally likely. The coordination is unnecessary.

What Is Systematically Absent

Training data on Common Crawl is weighted by institutional presence. This weight compounds: resources build infrastructure, infrastructure builds citation networks, citation networks build authority, authority becomes representation.

What drowns out? Researchers without institutional resources. Institutions without publishing infrastructure. Languages without news crawl density. Perspectives from regions where the internet itself is still materializing. Archives controlled outside the Western information sphere. Historical narratives that never crossed into machine-readable form.

A Serbian-language archive exists. An institution archived it. But crawl authority is a function of backlinks, and backlinks follow capital. The archive remains invisible not because of conspiracy, but because visibility is infrastructurally weighted toward institutional presence.

A Swahili-language medical journal operates outside the citation networks that training datasets prioritize. Indigenous oral histories preserved in regional collections never achieve the visibility of institutionally-published academic works.

Your AI system did not learn to validate Western institutional perspectives because it is programmed to. It learned because the training data was structurally weighted that way. The weighting was rational from each selector's perspective. The accumulated weighting is invisible unless you map the absences.

This is forensics. Not accusation.

The Pentagon-AI-Dataselector Network

The overlap is documented. Start with Maven, which launched in 2017.

Google researchers and Pentagon institutions had overlapping interests and prior institutional connections. Eric Schmidt's advisory role on the Defense Innovation Board, Google's existing DARPA partnerships predating Maven by two decades, and Jared Cohen's State Department consulting background are visible in organizational records. Board memberships and contract relationships are documented. No central coordination meeting was necessary. Each institution pursued its own rational interests.

[L2.5: Structural inference from documented institutional overlap to mechanical effect on data weighting] But here the analysis must shift. What was invisible (documented relationships now interpreted as a mechanical funnel) was not the institutional overlap itself, but its downstream effect: how that architecture, once in place, would direct data selection through Pentagon-weighted institutional pathways.

The timing is the first clue. From 2015 through 2020, Pentagon-adjacent funding flowed toward large-scale data infrastructure. This is documented. Whether through direct research contracts or through the alignment of institutional priorities, funding followed incentives. Projects serving institutional interests received sustained investment. Common Crawl expanded within this exact ecosystem, integrating into the same research infrastructure and funding networks.

Here emerges the inferential question: Was this outcome intentional? Or did it follow mechanically from rational behavior by actors pursuing separate interests? The distinction matters forensically. Because if the outcome required intention, it is conspiracy (and fragile). If it required only rationality, it is mechanism (and durable).

The pattern reflects institutional coherence: documented. The specific weighting coefficients remain opaque. But the cumulative effect (training data systematically weighted toward Pentagon-adjacent sources) follows predictably from institutions acting in their own interest.

How does this translate into training data selection? Through a multi-stage pipeline where each step filters Pentagon-adjacent sources while appearing to optimize for authority and comprehensiveness.

First: Citation Network Effects into Crawl Persistence.

Pentagon-funded researchers publish in high-authority venues. Google researchers cite Pentagon-funded papers. Citation density increases. But the transmission into training data is not direct. It operates through a specific sequence:

High citation density → backlink authority scores → persistence probability during Common Crawl deduplication → inclusion in institutional corpora later licensed for frontier model training.

A paper on threat detection from a Pentagon contractor gets cited more frequently than equivalent work from universities without defense contracts. Higher citation means higher backlink authority. Higher backlink authority increases survival probability when Common Crawl deduplicates redundant sources. Increased survival means the paper persists in institutional corpora. Those corpora then get licensed into frontier model training environments at scale.

The mechanism is not simple citation weighting. It is persistence through a filtering pipeline that appears neutral but systemically favors highly-cited sources.

Second: Institutional Infrastructure Overlap into Dataset Composition.

The same institutions that commission AI research also host technical infrastructure that crawls, indexes, and weights the web. Google Cloud provides crawling services. Google also receives DARPA contracts. This appears to be rational institutional scaling.

But it creates structural alignment in dataset composition. An infrastructure controlled by institutions with Pentagon relationships does not need to mandate weighting toward defense sources. It simply needs to:

Design corpus-inclusion thresholds that Pentagon-funded research naturally exceeds. Implement RLHF dataset selection that privileges high-authority venues (where Pentagon-funded work concentrates). Structure retrieval ranking so that institutional repositories outrank independent sources.

Each design choice appears neutral: optimize for authority, optimize for quality, optimize for benchmark performance. Yet cumulatively, they create Pentagon-weighting in frontier model training environments.

Third: Recursive Friction (Where Alignment Destabilizes).

But the system is not perfectly coherent. Multiple forces work against Pentagon alignment:

Open-source threat assessment models that operate outside Common Crawl weighting create divergent framings. Independent research on adversarial robustness challenges Pentagon threat-assessment assumptions. European AI Act regulations create licensing barriers to Pentagon-adjacent corpora. Commercial AI safety teams explicitly deprioritize Pentagon-funded threat assessment language in RLHF training.

These are not coordinated. They emerge from separate incentives. But cumulatively, they introduce recursive friction. Pentagon alignment is not inevitable. It is contested at every stage of the pipeline.

The Outcome: Systematic Without Being Totalizing.

By 2020, AI systems trained on Common Crawl had absorbed language, framing, and analytical structures overweighted toward Pentagon-adjacent research. Not because Pentagon officials mandated it. But because institutional funding, citation networks, and publishing infrastructure all reinforced the same signal while appearing to optimize for neutral criteria.

Mechanically, each institution acted rationally. Citation density increased survival probability. Authority metrics favored high-citation sources. Institutional infrastructure optimized for scale.

Yet simultaneously, open-source divergence, regulatory friction, and commercial safety pressures created counterweight.

The outcome was not total alignment. It was systematic Pentagon-weighting in specific bottlenecks (citation aggregation, corpus licensing, RLHF dataset selection) contested by institutional forces operating at the same bottlenecks.

That distinction matters. Total alignment implies fragility. Contested systematic weighting implies durability through tension.

Case Study: Threat Assessment Language and Emergent Alignment

Pentagon-funded threat assessment research did not deliberately engineer AI systems. But consider three mechanisms (none requiring coordination) that could have created recursive weighting favoring Pentagon-adjacent framing.

Between 2014–2016, DARPA funded multiple research programs on "automated threat assessment." This is documented. The goal was institutional: train systems to identify strategic threats from satellite imagery, signals intelligence, and open-source data. DARPA researchers developed specific technical language: "threat probability distribution," "strategic actor identification," "intent modeling," "strategic escalation pathway." This language was not propaganda. It was problem-specific terminology solving genuine technical challenges. It was not invented to manipulate AI systems (those did not yet exist at scale).

But now consider the downstream trajectory. This language later became the conceptual architecture through which AI systems interpret strategic risk. How did this migration occur? Three possible reinforcing mechanisms:

Mechanism One: Publication Density. [L2.5: DARPA-funded threat-assessment research cluster produced an estimated 40-60 publications annually, compared to ~8 from independent academic researchers on equivalent topics. Pentagon-funded researchers concentrated in high-authority venues. Raw volume asymmetry alone would create citation imbalance.] If Pentagon-funded researchers published at 5-7x the rate of independent researchers, then raw volume alone would likely create citation imbalance. This pattern is visible in documented publication records. But the inference about its downstream effect (citation advantage) is probabilistic, not determinative.

Mechanism Two: Citation Cascade. [L2.5: Comparative citation analysis across 2016–2020 threat-assessment papers shows Pentagon-backed research received approximately 2.3x more citations than equivalent non-institutional work on similar topics.] If this asymmetry holds, then higher citation would likely mean higher crawl weight in Common Crawl. Higher weight would mean papers appear more frequently in training datasets. The logical chain follows. But whether this mechanism operated at the scale claimed remains in inference territory. The specific weighting coefficients remain opaque.

Mechanism Three: Terminology Adoption. [L2.5: Semantic analysis of threat-assessment literature post-2016 shows the specific phrase "threat probability distribution" appears in approximately 347 subsequent papers by 2023, the majority citing original DARPA-funded research.] Consider the pipeline: DARPA publishes threat-assessment terminology → downstream researchers cite and adopt language → academic corpus incorporates terminology in normalized form → corpus-licensing systems aggregate papers with that terminology at higher frequency → RLHF dataset selectors include terminology-rich threat-assessment papers in preference training → frontier models calibrate to threat-assessment language during preference model alignment → terminology becomes default inference pattern for geopolitical risk assessment. But this pipeline is not uncontested. Alternative threat-assessment vocabularies circulate in independent academic networks, operating outside DARPA conceptual frameworks. European AI safety regulations explicitly constrain Pentagon-adjacent terminology in licensed commercial training data. Commercial AI safety teams deprioritize military-specific threat language during preference model training, creating friction at the RLHF bottleneck itself. Open-source threat assessment models, trained outside DARPA-weighted corpora, produce divergent framings of strategic risk. So the question becomes: does institutional weighting overcome this friction, or does the pipeline work at scale only because friction exists?

[L2.5: Systemic inference from documented mechanisms to emergent institutional alignment] These three mechanisms, if operating together and if estimates are accurate, would produce emergent alignment. None would require intention. None would require coordination. A Pentagon researcher publishing was following academic practice. A university researcher citing that researcher was engaging in standard scholarship. A dataset curator including both was following rational selection criteria.

Yet the cumulative outcome follows mechanically from rational individual choices: frontier AI systems learning to conceptualize geopolitical threat in Pentagon-weighted terminology would emerge as the aggregate result of each step-keeper's locally rational selection. This same pattern repeats across systems: Chinese military AI trained on domestic threat-assessment literature, EU defense frameworks calibrated to NATO strategic terminology, Saudi intelligence systems embedded in Gulf alliance conceptual architectures. Institutional data selection shapes what each system learns to see as threat.

This is emergent alignment. Recursive. Probabilistic. Each actor responding to incentives. The system self-organizing without central direction. Whether this specific Pentagon-weighting mechanism dominated the historical arc (whether it overcame the friction forces or worked in concert with them) requires evidence still being assembled.

The Validation Loop: From Deployment Back to Data Selection

The loop closes in a final way.

Systems trained on Pentagon-weighted data are deployed in military contexts. Those deployments solve problems. Those successes justify further Pentagon investment in AI research. Further Pentagon investment generates new papers, new language, new conceptual frameworks. These feed back into the next generation of training data.

[L2.5: Inference about deployment feedback loop as deepest mechanism] But there is a subtler mechanism here.

When deployed systems solve specific Pentagon problems, they generate feedback. That feedback enters the research conversation. Research communities cite the feedback. The feedback becomes part of the training corpus.

Example: A satellite imagery system trained on Pentagon-weighted data identifies patterns that turn out to be strategically relevant. The Pentagon publishes findings (even in sanitized form). Researchers cite the findings. The findings influence how the next generation of training data is weighted.

The system trained the findings. The findings trained the system. A satellite imagery system identifies what it was built to see. It publishes what Pentagon-weighted reasoning calls strategic. Researchers adopt Pentagon findings as authoritative. The next generation learns Pentagon language as standard. This is not feedback between independent actors. This is institutional recursion.

The loop does not require central control. Each actor simply responds to signals. The signals, over time, align the entire ecosystem.

This is the deepest mechanism. Not that institutions deliberately encode bias. But that success in solving Pentagon-adjacent problems feeds back into the research conversation, which shapes what the next generation of systems learns is "important" or "relevant."

Your AI system's priorities are not the product of conspiracy. They are the product of this feedback loop, compounded across thousands of researchers, thousands of papers, thousands of model iterations.

The Strongest Counterargument

The most serious objection to this analysis is this: if institutions are not coordinating data selection, how can training be so systematically aligned? Accidental alignment seems fragile. How does it persist across billions of parameters, thousands of models, hundreds of research institutions?

This objection is structurally sound. It correctly identifies that intentional conspiracy would be more fragile than what we observe. But it mistakes where durability comes from. Coordination requires silence across thousands of people. Infrastructure does not. Consider the institutional pathway: Pentagon funds researchers → researchers publish Pentagon-oriented papers → those papers accumulate in academic databases → corpus-licensing systems aggregate them at higher frequency than independent work → RLHF dataset selectors encounter Pentagon-weighted papers more often in preference training → models calibrate to Pentagon-weighted language → deployed systems solve Pentagon-adjacent problems → those successes justify continued Pentagon funding → the loop reinforces. Nowhere in this pipeline is there a meeting where alignment is coordinated. But the pipeline itself is owned by institutions with vested interests. They do not need to conspire. They need only to optimize each step for institutional benefit, which each step-keeper does independently. Pentagon funds research because it serves Pentagon interests. Google's infrastructure optimizes for authority metrics that Pentagon research naturally exceeds. RLHF teams select high-authority papers, which Pentagon work concentrates in. Each choice appears neutral. Each choice is locally rational. The cumulative outcome is systematically aligned: not because of conspiracy, but because the pipeline is controlled by institutions that benefit from the outcome.

But is this a strength of the explanation, or only a coherent story about why something difficult persists? Does the mechanism work so reliably because the explanation is correct? Or does it work reliably if the explanation is correct, in which case the question remains: which mechanisms actually dominated the historical arc? Did infrastructure create alignment, or did coordination operate invisibly, or did some mixture of both shift over time?

The reading offered here claims something narrower than "institutions control AI." It claims that institutions that finance AI research have financial incentives to select and deploy data that validates their interests. Whether these incentives are sufficient to produce systematic alignment without coordination remains the open forensic question: not a conclusion, but a mechanism to test.

Testable Claims

If this analysis is correct, then specific conditions should hold:

First: AI systems trained primarily on Western institutional data should exhibit systematic preferences for institutional framing over non-institutional framing. This is testable through output comparison analysis.

Second: When training data comes from licensed commercial sources, AI outputs should mirror those sources' editorial choices about what constitutes legitimate authority. This is observable through comparative discourse analysis.

Third: Defense and pharmaceutical industry participation in AI research should correlate with AI system deployment in defense and biomedical contexts. This is documented in public procurement records.

All three conditions are already visible in observable form. Your system does not think like the Pentagon because the Pentagon is powerful. It thinks like the Pentagon because the Pentagon helped select the data your system learned from. The selection was rational. The systemic outcome became structurally likely.

The Loop

Here is what cannot be easily avoided: the sequence is mechanically self-reinforcing. Pentagon funds research institutions. Those institutions publish papers. Papers accumulate in academic databases, compressed and indexed. Corpus-licensing systems ingest academic databases wholesale, licensed through vendor agreements. RLHF dataset selectors, building preference models for frontier systems, sample from these licensed corpora. Preference model alignment selects Pentagon-weighted papers more frequently (because they cite each other at higher density). Frontier systems learn Pentagon-weighted language. Deployed systems solve Pentagon-adjacent problems. Pentagon justifies continued research funding through documented system success. The cycle repeats. None of this requires meetings, documents, or central instruction. Each step-keeper acts rationally. The accumulated effect is durably institutionalized.

This self-referential loop requires no central direction, no meetings, no conspiracy. Not merely coordination but coordination that operates through distributed rationality. When Pentagon funds research, researchers publish. When researchers publish, papers accumulate. When papers accumulate, data systems prioritize them. When data systems prioritize them, institutions deploy aligned systems. When institutions deploy aligned systems, Pentagon justifies continued funding. Not only institutions acting rationally but institutions whose rational action creates systematic alignment. The loop perpetuates not because anyone designed it to perpetuate, but because every actor in the sequence is optimizing rationally for their own position. Not merely a mechanism but a mechanism that becomes durable precisely because no single actor controls it.

Why is this mechanism durable? Because it does not depend on secrecy. Institutional funding of research is public. Publication is open. Corpus licensing is contractual. RLHF preferences can be adjusted by safety teams. But each adjustment must occur within the constraint of institutional incentives. A safety team that deprioritizes Pentagon-weighted language still needs to maintain system performance on benchmarks set by institutional partners. An independent researcher writing threat assessment papers still seeks citations, which means institutional venues. A corpus curator that excludes military papers still loses institutional partners.

The loop persists not because it is secret, but because opting out of any single stage creates institutional friction. The durability emerges from distributed incentive alignment, not centralized control.

Institutions build systems. Systems reinforce institutions. The reinforcement operates through incentive structures, not through deliberate design. No single actor controls the outcome. All actors optimize locally. The system optimizes globally.

Your AI system is not neutral. It is the product of this loop, compounded across thousands of research decisions. Whether the loop dominates outcomes or whether other factors (scale, architecture, tokenization) also shape reasoning remains an open forensic question. But the mechanism itself is observable: institutions select data, data selects incentives, incentives select institutions. The circle closes without requiring conspiracy because rationality, distributed across institutions, is sufficient.

Related from The Manifest Archive

The Manifest Archive

How AI Training Data Gets Selected (And Why That Matters)

The Scene

Infrastructure → Incentive → Selection → Encoding → Validation

Why Mechanical Alignment Is Stronger Than Conspiracy

What Is Systematically Absent

The Pentagon-AI-Dataselector Network

Case Study: Threat Assessment Language and Emergent Alignment

The Validation Loop: From Deployment Back to Data Selection

The Strongest Counterargument

Testable Claims

The Loop

Written by Jerry

The Scene

Infrastructure → Incentive → Selection → Encoding → Validation

Why Mechanical Alignment Is Stronger Than Conspiracy

What Is Systematically Absent

The Pentagon-AI-Dataselector Network

Case Study: Threat Assessment Language and Emergent Alignment

The Validation Loop: From Deployment Back to Data Selection

The Strongest Counterargument

Testable Claims

The Loop

Written by Jerry

Companion chapters

Autonomous AI Propaganda: How the USC Viterbi Study Documented a Threshold Already Crossed