State Department's Cyber-Agent Sandbox Is Getting Results — But Staying There for Now | Blog

An AI agent that can triage malware in 25 minutes instead of four days sounds like something you’d deploy immediately. The State Department (State) is choosing not to — and that restraint is the most interesting part of the story.

Ray Romano, deputy assistant director of State’s Cyber Threat and Investigations division, revealed the sandbox results at a federal cyber resilience event last week. The numbers are striking: agentic AI sorting through malware on a thumb drive completes in 25 minutes what takes a human analyst four days — over 75 times faster. But Romano was equally clear about where the agent lives: in a controlled sandbox environment, not inside anyone’s actual workflow.

The Gap Between “Works in the Sandbox” and “Trusted in Production”

State is in a phase most organizations skip entirely — measuring the agent’s error rate before measuring its speed gains. “We’re training agents to do that analysis right, but we’re not just trusting it,” Romano said. “We’re fact-checking it against work that’s already been done.” The team is tracking two outputs simultaneously: the time savings being purchased and the failure rate being introduced. Until they understand both numbers, the agent doesn’t graduate from sandbox to staffers.

This isn’t timidity — it’s the correct sequencing. Deploying a tool that’s fast but wrong 8% of the time in a cybersecurity context doesn’t save time; it adds a new class of mistakes to manage.

Building the Infrastructure Before Flipping the Switch

Amy Ritualo, State’s acting chief data and AI officer, offered the broader strategic picture at a separate event in April. She framed the department as being in an intentional “exploration stage,” focused on understanding where agentic AI actually fits before committing to deployment. The department wants monitoring architecture in place — systems that can track not just what an agent decided, but why it made that choice.

Ritualo also drew a clear line: agentic AI will not be allowed to make foreign policy decisions. It can stress-test policy options, but the actual calls stay with the humans who own them. That kind of explicit scope-setting before deployment is rare, and it matters — the three pillars of any serious AI governance program are inventory, checkpoints, and standards alignment, and State is visibly working all three before the agent touches production.

”Do the Hard Work First”

Romano’s advice to other agencies was pointed: “Be careful of running towards the next shiny toy and just trying to deliver cool tech — do the hard work first. Put that governance and security in place, because if we don’t wrap our heads around this disruptive incoming AI, this is going to become our largest vulnerability.”

That framing reorients the whole conversation. Most agentic AI coverage focuses on capability — what the agent can do. State is focused on what happens when the agent does something wrong in an environment where security is the product. The costs of skipping human oversight in autonomous systems tend to emerge suddenly and at scale. State is making a deliberate bet that taking longer to get it right is cheaper than cleaning up a fast mistake.

The sandbox is producing results worth deploying. The question State is asking — and forcing themselves to answer slowly — is whether the rest of the infrastructure is ready to deploy around it.

AI Disclosure

This document is drafted by an AI skill and is provided for informational and governance support purposes only. It does not constitute legal advice or a formal compliance determination. Do not publish or rely on this notice as a substitute for review by qualified legal counsel or a licensed compliance professional with jurisdiction-specific expertise.

The Gap Between “Works in the Sandbox” and “Trusted in Production”

Building the Infrastructure Before Flipping the Switch

”Do the Hard Work First”

Further Reading

AI Disclosure