The Classified AI Doctrine
AI adoption is risky. So is delay. Security policy has to measure both.
The national-security enterprise is hearing two true stories about AI adoption at the same time. The goal here is to measure the cost of delayed adoption against the costs and benefits of adoption, on one national-security ledger.
How to read it: each bar is a level of AI adoption. Blue (above the line) is the annual national-security gain from productivity using the Torres defense production-function model. Red (below) is deliberately conservative: it charges every bar the full expected annual cost of all ten feared security-event categories, as if AI adoption owned the entire security ledger. The white tick is the net. No adoption escapes none of that cost, because the same threat categories (insider risk, spillage, foreign cyber activity, supply-chain exposure, and today’s unmanaged workarounds) persist under the status quo as part of the workforce reverts to unmanaged AI usage; it is all cost and no gain. By medium adoption the gains win, reaching +8.9% per year improved national security impact at full adoption. The gold zone is where the U.S. government started in early 2026 in chatbots on older generation AI models: the only zone where the net is negative. Yes, everyone is moving as fast as they can. This article is about turning the Security organization into the accelerator.
Security professionals see real adversaries, real classified information, real CUI, real insider risk, real supply-chain risk, and real accreditation obligations. They are right to reject any AI adoption plan that treats those risks as paperwork.
The broader workforce sees something else that is also real: modern AI can already compress writing, coding, research, discovery, review, triage, summarization, training, and administrative work. On classified and CUI systems, those are not office conveniences. They are mission throughput.
The mistake is turning those truths into a binary fight between AI and security. That is not the decision. The decision is whether the government should:
delay while conducting narrow pilots;
tolerate unmanaged workarounds as users seek capability outside the approved path; or
field managed AI inside accredited CUI and classified environments with measurable controls, logging, provenance, no external egress, model supply-chain controls, and human verification for high-impact work.
That third path is the policy this model supports. Not because AI is safe. Because managed adoption appears safer than either unmanaged adoption or delay once productivity, baseline threats, AI-incremental risk, and defensive control gains are scored on the same national-security ledger.
Security should not have to prove AI is risk-free before fielding it. The burden of proof for delay should be that a proposed delay reduces more national-security harm than it causes.
Bottom Line Up Almost-Front
This is the shortest version of the argument.
Managed AI adoption on classified and CUI systems appears to increase national security more than the feared AI-incremental security risks decrease it. A categorical delayed-adoption default does not remove most of the threat landscape. It mainly gives up the productivity term while baseline insider risk, spillage risk, foreign cyber activity, supply-chain exposure, flawed analysis, aggregation risk, and user workarounds remain.
The load-bearing numbers are straightforward. By 2035, Low AI adoption adds about 1.3% of annual national-security output, Medium adds about 5.1%, and High adds about 10.5%. The equivalent cleared-labor figures are shown below because they are more relatable to the national-security enterprise than percentages.
Those “effective worker” numbers are not a force-reduction argument. They are a mission-capacity argument. They mean fewer analytic backlogs, faster software delivery, faster acquisition review, faster issue triage, faster security processing, better document review, more red-team coverage, and more time for expert judgment. Converting the gain into mission output also buys time to retrain people toward emerging mission and a more rapid response to societal needs.
The central managed-enclave policy is about +3.5% of annual national-security output after implementation cost drag and counterfactual AI risk. Unmanaged adoption is about -2.1%, which is why the recommendation is not “let everyone use whatever AI they want.” The recommendation is the opposite: give the workforce a sanctioned path that is strong enough to pull real work into the managed control surface.
Now the vocabulary those results assume.
Low adoption means standalone chatbot use with the latest model: ask, copy, paste, and clean up. It is useful, but outside the main workflow. Medium adoption means an assistant (Codex, Claude, Copilot, and peers) embedded in the workflow plus retrieval-augmented generation (RAG), grounded on approved government corpora. That is the recommended near-term target. High adoption means human-supervised agentic workflow: multi-step, tool-using orchestration over mission tools, gated by least privilege, logging, sandboxing, and human approval for high-impact actions.
Managed adoption means current-generation AI inside an accredited enclave with approved corpora, no external egress, access control, provenance, logging, monitoring, red teaming, and model supply-chain review. “Current-generation” model is critical, for a reason the model section spells out: the national-security enterprise should not be operating on knowledge months older than the rest of the world. Unmanaged adoption means AI use outside the sanctioned control surface: uncontrolled contractor-side use, transcribing material to and from the low side, personal accounts and devices, or other routes with weak telemetry.
AI-incremental risk is the counterfactual difference between risk with managed AI and risk under the status quo. It does not mean the risk is linear or small. It means the model charges AI only for the part AI causes or worsens. When a chart says “full expected event toll,” it is using a harsh accounting test that charges every adoption scenario the same gross expected annual toll from the ten modeled security categories. It is not the formal counterfactual risk model, and it is not a claim that every imaginable incident happens every year.
The first chart is the executive sanity check.
Read this as a burden-of-proof chart. It charges every adoption scenario the same expected annual toll, about 3.4% of output, from the ten security-fear categories, as if AI owned that entire ledger. That is deliberately harsh to the AI side. Even then, full adoption lands around +7.1% after the toll. Medium adoption lands around +1.8%. No adoption does not become safe. It sits at zero before the same security-event toll and around -3.4% after it. One accounting note so the two charts reconcile: this toll is larger than the roughly 1.6% per year charged in the opening chart because the two come from different risk priors. This one comes from a standalone elicitation of the ten fears; the opening chart uses the counterfactual world model. Both charge the full toll against AI, and the scenario ordering is the same under either.
That chart is not the formal risk model. It is the “stop making the accounting error” model. The accounting error is treating all security risk as AI-created risk. Insiders exist without AI. Spillage exists without AI. Foreign cyber operations exist without AI. Supply-chain compromise exists without AI. Flawed analytic assessments exist without AI. Aggregation risk exists without AI. A categorical non-adoption policy can avert only the AI-caused or AI-worsened part of those risks. It cannot make the status quo clean.
The second chart runs the decision over time. It uses 40,000 simulated futures from 2026 to 2035.
The conservative move in this chart is easy to miss: AI capability is held fixed, and only adoption changes. That is almost certainly false in the real world. AI capability is improving quickly. The chart freezes it anyway because the point is not to argue with assumptions. The point is to measure and argue with data. More adoption means more national-security output. Faster adoption means the advantage compounds sooner. “Delay while we pilot” has a measured cost.
One convention before the rest of the argument: every claim here is time-boxed to the public model as of June 2026. This is not a classification decision. It is an unclassified, public-source policy model. Where the model projects, it labels the projection. Where the evidence is transferred from commercial work into classified work, it says so.
How To Grade An AI Security Policy
The usual question is too vague: “Is AI risky?” Of course AI is risky.
The better risk question is: which policy produces less total national-security damage?
A risk policy works if a causal chain runs from the policy to measured national-security gain or measured risk reduction. A risk policy fails if the risk appears anyway through another path: baseline insider activity, existing supply-chain compromise, public-SaaS workarounds, low-side cut-and-paste, contractor-side tools, degraded official tools, or adversary adoption. A policy fails if it gives up more national-security output than the risk it can actually avert.
Graded that way, the scorecard splits in a way the current debate often misses.
Security professionals often agree on many of the controls. The bottleneck is usually not the concept of controls; it is the evidence burden and the review process around them. Bodies of evidence take weeks or months to assemble, move, interpret, and rework, and there is often no clear threshold at which the answer becomes “yes.” That failure mode turns accreditation into a queue instead of a risk decision. Queuing up more advanced AI models and tools causes delay, and delay also has a cost.
The answer is not to bypass Security. The answer is to make accreditation evidence-producing: predefined test suites, accepted control metrics, continuous telemetry, red-team reports, signed model provenance, DLP results, marking-propagation accuracy, and human-review gates. If the evidence passes, the workflow should move. If it fails, the workflow should pause at that boundary and the control owner should know exactly what to fix.
That is the doctrine: no categorical non-adoption default, no unmanaged adoption, and managed adoption with Security as the accountable control owner.
The National Security Model
There is no perfect public equation for national security. If someone tells you they can precisely measure national security in one number, be skeptical. This decision does not require a perfect number. It requires a defensible way to compare a labor-productivity gain against an AI-incremental risk term.
The project uses a citable defense production-function model from Jose L. Torres’ 2020 PLOS ONE paper, “The production of national defense and the macroeconomy.” It models defense output as a function of capital and labor:
D = B * [mu*K^rho + (1-mu)*L^rho]^(1/rho)
rho = (sigma - 1) / sigmaThe important part is not the Greek letters. The important part is that the model decomposes national-security output into labor and capital. AI is a labor-productivity shock, so AI enters as effective labor:
L_AI = L * (1 + g * alpha * e)Here, g is the productivity gain on tasks AI can help with, alpha is the share of cleared knowledge labor exposed to AI-augmentable tasks, and e is realized adoption, not mere access.
The e term matters. Access is not adoption. A chatbot in another browser is not the same thing as an assistant embedded in the analyst tool, the IDE, the document system, the acquisition workflow, or the security-processing queue. And a model that cannot retrieve approved mission data is not the same thing as RAG inside an accredited enclave. This deserves a footstomp, because it is also why the math assumes any adoption path keeps its AI models current. An AI model has a knowledge cutoff; it knows nothing after that date. People adapt to a changing world faster than a static model does. If the nation is at war tomorrow, the people will know, but a stale model needs constant reminders. If peace breaks out the day after, the people will know, but the stale model still infers war. The world changes, policy changes, regulations change, laws change. Current models plus RAG over approved corpora are what give an AI current, pertinent information.
The Low, Medium, and High adoption cases are work modalities, not arbitrary labels.
Commercial analogies help:
Low adoption looks like a ChatGPT-style side chat. In government, that maps to a standalone assistant for unclassified or tightly bounded CUI tasks, with copy/paste risk, weak provenance, limited mission data, and limited workflow integration.
Medium adoption looks like GitHub Copilot, Microsoft 365-style copilot, Claude Code/Cowork, OpenAI Codex, etc. In government, that maps to an inline assistant in coding, analytic, document, acquisition, legal, contracting, or security-review tools, grounded on approved corpora. The government-specific burden is need-to-know, classification markings, source lineage, no external egress, audit logs, DLP, and accreditation. Much of this already exists on the high side; it just needs integration with the AI tools.
High adoption looks like tool-using agents in software, ticketing, data, or workflow systems. In government, that maps to human-supervised orchestration over mission tools, with least privilege, sandboxed execution, command logging, human approval for high-impact actions, and narrow authority boundaries.
That matters for the recommendation. This is not an argument that analysts should become spectators while autonomous agents run the Intelligence Community. The recommendation is more disciplined: field assistants plus RAG inside accredited enclaves, then gate agentic workflows until the enclave, workflow instrumentation, and controls are mature.
The Productivity Term
The cleared knowledge workforce modeled here is about 1.85 million people: Intelligence Community (IC) and Department of War (DoW) government knowledge workers plus cleared defense and intelligence contractors. The model estimates the AI-exposed share at about 57%, or roughly 1.06 million AI-augmentable cleared knowledge workers.
The model then applies conservative task-level productivity assumptions. Low adoption uses a 10% uplift, below the larger writing, customer-support, and coding RCT effects because standalone chat is intermittent and outside the workflow. Medium adoption uses a 25% uplift, aligned with peer-reviewed 2026 evidence around integrated copilot/RAG-style work, including software and consulting tasks, but then discounted by exposure and realized adoption. High adoption uses a 40% uplift, below favorable agentic-workflow results and treated as phase two, not immediate whole-job autonomy.
Those are not the largest numbers in the literature. Noy and Zhang report large gains for professional writing. Brynjolfsson, Li, and Raymond find material gains in customer support, especially for less experienced workers. Peng et al. report large coding-task speedups in an early Copilot experiment. Cui, Demirer, Peng, and coauthors report about a 26% completed-task gain across three software-development RCTs in peer-reviewed 2026 work. Dell’Acqua et al. report strong in-frontier gains for consultants and worse performance off the frontier. Ju and Aral report favorable agentic-teamwork gains, while the Remote Labor Index suggests fully autonomous whole-job completion remains small. METR’s open-source developer result is a warning that AI can slow experts on complex, mature codebases, and Humlum and Vestergaard’s near-null population-level result is a reason the model discounts capability numbers by realized adoption instead of taking them at face value.
That is why the model does not assume every cleared worker instantly becomes 40% more productive. It discounts by exposed labor share and realized adoption. It also holds AI capability fixed through 2035 in the central time-series chart, even though the capability frontier has already measurably improved since this project started. This project is asking how much effective labor increases once AI is used on the work it can actually help with.
Under Medium adoption, the 2035 result is equivalent to adding roughly 185,000 effective cleared workers. Even a rough version is strategically large. The national-security enterprise argues constantly about recruiting, clearances, retention, billets, contractor ceilings, analytic coverage, cyber backlogs, software queues, acquisition speed, and security-processing timelines. Leaving a six-figure equivalent mission-capacity gain on the table is not a neutral act. It is an opportunity cost.
The Fears Are Real
A serious AI adoption argument has to steelman the security fears. This project uses ten categories and maps them onto the government’s own EO 13526 damage vocabulary.
The “some damage” concerns are CUI mishandling, privacy or civil-liberties harm, and loss of provenance, auditability, or classification lineage. The “serious damage” concerns are data spillage, cross-domain contamination, hallucination corrupting analysis, poisoning, prompt injection, automation bias, and tradecraft erosion. The “exceptionally grave damage” concerns are aggregation or mosaic inference, AI-amplified insider threat, model or data exfiltration, and supply-chain or foreign-model compromise.
This is the right list to worry about. It is also the wrong list to use as a categorical stop sign unless the counterfactual question is answered:
How much of each risk is caused by managed AI adoption?
How much already exists?
How much gets worse under unmanaged adoption?
How much gets better because AI improves monitoring, classification support, provenance, anomaly detection, review consistency, and defensive triage?
That is where the “delay while we pilot AI” case weakens.
The counterfactual model estimates three worlds (status-quo delay, managed accredited adoption, and unmanaged adoption) using 60,000 Monte Carlo draws.
One disclosure for skeptics, because it changes how these numbers should be read: the event probabilities and losses behind this chart are a synthetic expert panel: four independent AI models, each answering as six security personas, pooled into twenty-four elicitations. The four models disagree on absolute magnitudes but agree on the ordering: managed below status quo, unmanaged far above both. The pooled panel is an explicitly labeled prior, not a substitute for cleared human expert judgment, and the falsifier section below says exactly how human elicitation should update the model.
The managed result is not “AI is risk-free.” At the 95th percentile, managed AI-incremental loss is about +0.85%. At CVaR99 (the average loss in the worst 1% of modeled outcomes), it is about +1.91%. That is real. But it is smaller than the Medium productivity benefit to national security.
If the goal is to improve national security, the burden of proof on delay arguments is that the specific risks impact national security by more than the cost-drag-adjusted productivity benefit.
AI Risk Can Be Nonlinear
One objection to the phrase “AI-incremental risk” is that AI may not add a small linear increment. It could create nonlinear or compounding risk. That objection is legitimate.
AI can make aggregation risk worse because retrieval, summarization, and synthesis make it easier to combine individually innocuous facts. AI can amplify insider threat because one person can query, summarize, transform, and stage more information faster. AI can accelerate data exfiltration if egress controls fail. AI can create correlated analytic error if many offices rely on the same flawed model output. AI can scale prompt injection, poisoning, and automation bias in ways that do not look like ordinary software risk.
The word “incremental” is not meant to deny any of that. It means counterfactual. The policy question is not whether AI can worsen a risk category. It can. The policy question is how much managed AI changes that risk compared with delay and compared with unmanaged adoption.
That distinction is exactly why managed adoption matters. No egress, approved corpora, retrieval provenance, least privilege, user and entity behavior analytics, DLP, red-team testing, model provenance, human verification, and classification-aware review are not cosmetic controls. They are the tools that keep nonlinear risk from becoming uncontrolled risk. If those controls fail measurable tests, the workflow should pause. If they pass, indefinite delay needs its own quantified justification.
“Delay While We Pilot” becomes Unmanaged Adoption
The strongest form of the non-adoption argument assumes no official AI other than pilot programs means no AI risk. That assumption does not survive contact with human incentives.
The official system can say no, but the surrounding ecosystem still contains public AI systems, home computers, contractor-side tooling, low-side summaries, personal productivity accounts, and colleagues who can access stronger models. Even ordinary google search now returns AI-generated answers by default, so much of the workforce already uses AI through unmanaged paths every day without ever deciding to adopt it. Then the Information Assurance arm of Security restricts access to google and then the workforce has a choice between being trapped in a previous decade, or doing work on their unmanaged personal devices.
This is not a moral claim about users. It is an institutional claim about incentives. If the sanctioned path is too slow, too weak, or unavailable where the work happens, some work will route around the system. Some of that will be careful. Some will not. The organization will see less telemetry, not more.
That is the worst of both worlds: much of the risk and far less of the productivity gain. The model sees that branch clearly. Unmanaged adoption is negative. A categorical delay posture does not guarantee no AI. It can produce the worst AI.
Delay also undercounts the defensive side. AI can make some baseline security problems worse, but it can also make some baseline security problems easier to detect, measure, and control. AI-assisted user and entity behavior analytics can detect anomalous bulk access faster. Classification-aware tools can catch marking mistakes and spillage attempts. Provenance systems can make derivative outputs easier to audit. Retrieval-grounded assistants can force source display and citation checks. Models can serve as tireless reviewers for missing caveats, inconsistent markings, and analytic hygiene.
None of that happens automatically. It happens only if Security owns the control package instead of keeping the capability outside the system.
Security’s Own Scale Cuts Both Ways
Security has a powerful vocabulary for harm. Confidential means unauthorized disclosure could cause damage. Secret means serious damage. Top Secret means exceptionally grave damage. That vocabulary is usually pointed at disclosure risk, but this project uses it as a magnitude yardstick for both sides of the scale.
This is not a legal reclassification claim. EO 13526 does not assign percentages to its tiers. This is a conditional burden-of-proof argument: if feared AI incidents are being scored on a national-security damage scale, then the national-security output lost to AI restraint should be scored on the same scale for an apples-to-apples comparison.
By 2035, every less-than-full-rapid adoption path imposes an annual opportunity cost in the model’s exceptionally-grave magnitude band. Medium adoption versus full rapid adoption forgoes about 5.4% per year. A conservative rollout forgoes about 6.6% per year. Low adoption forgoes about 9.2% per year. No adoption forgoes about 10.6% per year.
Again, this is not a literal classification determination. It is a symmetry test. If a low-probability AI incident deserves grave language because it could damage national security, then a recurring, near-certain productivity loss of 5% to 10% of annual national-security output also deserves grave language. Security policy should not use the damage scale only on the risk side of the ledger.
The China Argument Cuts Both Ways
China is a real adversary, and not just a stalking horse in national security debates. China might compromise a model, exploit AI spillage, poison data, use foreign-origin models as collection surfaces, or accelerate cyber and intelligence operations with AI. Those are valid threat paths.
But China also benefits if the United States makes its cleared workforce slower. China benefits if American analysts, engineers, operators, acquisition officers, program managers, and security processors are kept off frontier-grade tools while Chinese operators, companies, and state organs adopt aggressively under their own risk calculus.
“Delay AI while we pilot” is not a shield against China. It is a tax collected on ourselves. “Behind” is not the same as “blocked.” China does not need perfect AI to use AI. It needs enough capability, enough urgency, and enough institutional permission to keep moving. The United States national-security enterprise has capability. The control question is whether it can give itself permission in a way that is secure enough and fast enough.
Recommended Doctrine
The doctrine should be:
no categorical non-adoption default;
no unmanaged adoption;
managed accredited adoption with Security as the accountable control owner.
That changes Security’s mission from “prevent AI use” to “field the controls that make AI usable at scale.”
The control missions are concrete.
The implementation order should be boring and fast:
Reject categorical non-adoption as the default.
Prohibit unmanaged adoption by giving the workforce a sanctioned path.
Field a government-hosted CUI enclave first. The Department of War and much of the Intelligence Community are effectively here.
Field classified RAG enclaves with approved corpora and no external egress. That means the AI has access to the classified intel the user has a need-to-know and authorization for.
Treat agentic classified workflows as phase two, gated by least privilege, command logging, sandboxed execution, and human approval.
Make accreditation evidence-producing. Authority to Operate should not mean queue time. It should mean measured controls.
Require a quantified burden of proof for further delay.
A security office should be able to delay a workflow only by showing one of four things:
the workflow cannot meet the required controls, where the controls themselves have a demonstrated, measured national-security benefit;
managed AI-incremental risk exceeds the cost-drag-adjusted productivity benefit;
specific threat intelligence changes the supply-chain, insider, or exfiltration risk enough to cross the break-even line; or
the proposed use is actually unmanaged or agentic beyond approved authority.
“We are concerned about China” is not a risk determination. It is the beginning of one.
What Would Change The Model
This should not be an argument that lives forever in abstraction. The model should change when measured evidence changes.
None of this has to be argued abstractly. It can be measured. That is the point.
The Mental Model
AI adoption is not automatically safe. Unmanaged AI has real risks. Managed AI is the highest-security policy in the model.
Default delay does not remove most baseline threats. Default delay does remove the productivity gain, and that gain is strategically large. The mission is not to block AI. The mission is to field the controls that let cleared personnel use AI faster than adversaries can exploit either our data or our hesitation.
That is the job Security should want.
Citations
Main public sources behind the model:
Jose L. Torres, “The production of national defense and the macroeconomy,” PLOS ONE, 2020: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0240299
Correlates of War National Material Capabilities data: https://correlatesofwar.org/data-sets/national-material-capabilities/
OpenAI GDPval: https://openai.com/index/gdpval/
Noy and Zhang, “Experimental evidence on the productivity effects of generative artificial intelligence,” Science, 2023: https://doi.org/10.1126/science.adh2586
Brynjolfsson, Li, and Raymond, “Generative AI at Work,” NBER, 2023: https://www.nber.org/papers/w31161
Cui, Demirer, Jaffe, Musolff, Peng, and Salz, “The Effects of Generative AI on High-Skilled Work,” Management Science, 2026: https://doi.org/10.1287/mnsc.2025.00535
Dell’Acqua et al., “Navigating the Jagged Technological Frontier,” Organization Science, 2026: https://doi.org/10.1287/orsc.2025.21838
Peng et al., “The Impact of AI on Developer Productivity,” arXiv, 2023: https://arxiv.org/abs/2302.06590
Ju and Aral, “Collaborating with AI Agents,” arXiv (revised February 2026): https://arxiv.org/abs/2503.18238
METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” arXiv, 2025, with METR’s February 2026 experiment update at metr.org: https://arxiv.org/abs/2507.09089
Humlum and Vestergaard, “Large Language Models, Small Labor Market Effects,” NBER Working Paper 33777 (revised 2026): https://www.nber.org/papers/w33777
Remote Labor Index, arXiv, 2025: https://arxiv.org/abs/2510.26787
EO 13526 and ISOO FAQ: https://www.archives.gov/isoo/faqs/e-o-13526-and-32-cfr-part-2001
NIST AI Risk Management Framework 1.0: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10
NIST Generative AI Profile, NIST AI 600-1: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence
NSA, CISA, FBI, and partners, “Deploying AI Systems Securely,” 2024: https://media.defense.gov/2024/apr/15/2003439257/-1/-1/0/csi-deploying-ai-systems-securely.pdf
ODNI 2026 Annual Threat Assessment: https://www.dni.gov/files/ODNI/documents/assessments/ATA-2026-Unclassified-Report.pdf
Methodology note: every model output in this piece regenerates deterministically from the project’s code and data (fixed random seeds; figures render directly from model outputs with no hand-entered numbers). The risk priors are a labeled synthetic elicitation pending cleared human expert elicitation and classified workflow pilots, for which prepared instruments exist.










