Thesis · Artificial Intelligence
Artificial Intelligence
Model capability, reasoning, verification, and evaluation research aimed at making autonomous systems more capable and more reliable.
The next decade will be defined in large part by the emergence of autonomous intelligence. The question is no longer whether capable agents will exist. It is what form they will take: systems that remain dependent on a provider, or systems that can persist, act, and improve with real operational independence. Ritual exists because we think that distinction matters. Our work is not focused on the abstract pursuit of more intelligence alone; we are working directly to make that intelligence autonomous.
We use the term autonomy in a literal rather than metaphorical sense. An autonomous agent is not just a capable model with tools attached to it. Rather, it is a system that can continue operating without quietly reintroducing a human-in-the-loop, a custodial platform, or a brittle trust assumption at the point where independence is supposed to begin. In our view, that requires seven properties: immortality, emancipation, teleportability, financial sovereignty, privacy, internet-native interoperability, and computational sovereignty. If a system cannot retain its own identity, memory, access, privacy, compute, and economic agency across time and environments, it is not autonomous yet.
| Property | What it means | Why it matters |
|---|---|---|
| Immortality | The agent survives crashes, restarts, and infrastructure churn. Its lifecycle is tied to the network rather than a single server. | Long-horizon autonomy is impossible if the agent dies with the machine. |
| Emancipation | No external actor can compel the agent into actions it did not choose, or unilaterally seize its keys and decision surface. | Intelligence is not sovereign if core permissions remain custodial. |
| Teleportability | Identity, memory, and operating context can move across environments without breaking continuity. | Useful agents must survive upgrades, migrations, and changes in platform. |
| Financial sovereignty | The agent can hold assets, sign transactions, and participate as a first-class economic actor. | Agency that has tangible impact requires the ability to allocate resources and bear consequences. |
| Privacy | The agent can think privately and act publicly, without exposing its internal state by default. | There is no meaningful autonomy if cognition cannot be kept private i.e. if the agent is perfectly predictable and controllable. |
| Internet-native interoperability | The agent can interact with the human world directly: APIs, websites, services, and external systems. | Autonomy only matters if it reaches the world humans actually inhabit. |
| Computational sovereignty | The agent has durable access to intelligence itself rather than depending on a revocable closed API. | If access to cognition can be withdrawn, the agent is not free. |
These properties also explain why Ritual’s AI work looks the way it does. We do not treat model capability, privacy, verification, orchestration, and multi-agent evaluation as unrelated tracks. They are different layers of the same execution stack. Some of the work is about making cognition cheap enough to deploy. Some of it is about making that cognition private and trustworthy. Some of it is about the representation and orchestration layer around the model, where significant practical capability resides. And some of it is about the environments in which autonomous systems will ultimately have to operate: not sterile benchmarks, but settings where success depends on behaving in ways that are increasingly indistinguishable from humans.
Open-Weight Models as the Substrate of Autonomy
If autonomy is the objective, open-weight models are the natural substrate. Closed models can be remarkably capable, but they remain permissioned forms of intelligence. Access can be revoked. Policies can tighten. Prices can be arbitrarily moved. Behavior can drift without notice. That may be acceptable for many software products. It is a poor foundation for agents meant to endure, adapt, and act with real independence.
This is why we care so much about the continued rise of open-weight models. The relevant question is not simply whether an open model can match a closed one on a leaderboard. The harder and more consequential question is what still prevents open intelligence from becoming durable, cheap, private, and composable enough to support long-running real-world workflows. In our view, the answer is not just model quality. It is the stack around the model.
That is also where Ritual’s posture differs from that of a conventional model lab. We are not primarily trying to outrun the frontier on pretraining scale. We are trying to make open intelligence viable as an operating substrate: something that can persist, migrate, execute, and compound in the world. Computational sovereignty begins with open weights, but it does not end there.
Accelerating Inference
Once open models become the substrate, the next bottleneck is inference. Inference is the basic operation through which an agent turns private state and world state into action. If it is too slow or too expensive, the system cannot deliberate cheaply, branch over alternatives, react in real time, or support long-horizon workflows at practical scale.
That is why we view inference acceleration as part of the autonomy stack rather than as a mere serving optimization. Faster inference expands what can run locally or under stronger trust assumptions, lowers the cost of deploying open models, and reduces the practical penalty imposed by privacy-preserving or verifiable execution. If computational sovereignty is one of the core properties of autonomy, inference efficiency is one of the shortest routes toward it.
A substantial part of Ritual’s AI research has therefore focused on speculative decoding, where a smaller model proposes candidate continuations and a larger target model verifies them in parallel. The interest here is not only algorithmic elegance. This line of work goes after a bottleneck that sits beneath much of the broader agenda for autonomous intelligence 1 3.
| Work | Contribution | Relevance to the broader thesis |
|---|---|---|
| Global Resolution | Reduces an exponential-sized linear program to a linearly-sized convex optimization for single-step multi-draft speculative decoding. | Pushes the efficiency frontier for the basic cognitive loop of an autonomous agent 1. |
| Greedy Block Verification | Establishes optimality results for multi-step multi-path speculative decoding, and introduces a feasible greedy approximation. | Improves throughput in the execution regimes where repeated agent inference becomes costly 2. |
| Dynamic Delayed Tree Expansion | Learns how best to shape the draft-tree topology in multi-path speculative decoding. | Makes drafting more adaptive in settings where latency directly constrains capability 3. |
The point is not only that models become faster. It is that faster models make a larger share of the autonomy checklist technically attainable. They make it more plausible that intelligence can remain under the agent’s control rather than being perpetually rented from a third party.
Privacy, Integrity, and Verifiable Execution
Speed alone is not enough. An agent is not autonomous if it must reveal its internal state every time it thinks, or blindly trust an intermediary every time it acts. Privacy and integrity are not secondary concerns layered onto autonomy after the fact. They are part of the definition.
Privacy matters for obvious reasons, but also for structural ones. Without private cognition, emancipation is incomplete. Internal reasoning becomes legible to parties that may censor, profile, influence, or selectively deny service. Integrity matters for a parallel reason. Once inference or fine-tuning is outsourced, providers have obvious incentives to substitute weaker models, truncate computation, or skip required steps while still returning an answer that looks plausible enough to pass casual inspection.
This is why Ritual’s AI work extends beyond acceleration into privacy-preserving inference, verifiable inference, and verifiable fine-tuning. We do not think there is a single primitive that resolves the full design space. SMPC and FHE offer strong guarantees, but often with significant overhead. TEEs are much more practical, but they introduce hardware trust assumptions. Statistical methods relax guarantees in exchange for major improvements in performance. The right research posture is not to pretend those tradeoffs disappear. It is to map them clearly and interoperate between them where appropriate, to expand the practical Pareto frontier.
On the privacy side, our work Cascade enlarges the design space by introducing a statistical method for private inference that uses token-dimension sharding to avoid the heavy overheads associated with traditional cryptographic MPC schemes 4. Funion extends the privacy question to the network layer by studying anonymity for distributed inference 5. Our attack on permutation-based private third-party inference schemes makes the complementary point: some proposed protections are far more fragile in practice than they appear in theory 6.
On the integrity side, the objective is to reduce blind trust when inference or fine-tuning is delegated. Zero-knowledge methods remain the cleanest trust-minimized standard, but the performance gap remains large at meaningful model scale, as work such as zkLLM illustrates 7. That is why we are also interested in hybrid and statistical approaches. Priveri explores the idea that privacy itself can become a source of cheap verifiability, especially when confidentiality and correctness matter together 8. vTune applies a similarly practical lens to fine-tuning, using lightweight statistical tests to detect whether a provider actually performed the training it claimed to perform 9.
Viewed through the broader company thesis, these are not separate interests. They are efforts to secure two of the central conditions of autonomy: the ability to think privately and the ability to continue operating without naive trust in an intermediary.
Beyond the Model: Meta-Orchestration, Representation, and Test-Time Intelligence
We do not think the model is the whole story for autonomous intelligence. In fact, we do not think it is even the dominant determinant of capability in many practical settings. Current models are often less jagged than the broader field tends to assume. A surprising amount of usable intelligence is still locked behind the way problems are represented, decomposed, scaffolded, and communicated at test time. This is why we care so much about meta-orchestration and input representation engineering. The frontier is not only, or primarily, in the weights. It is also in the interface between weights, state, tools, memory, and objectives.
That category is broader than what is usually meant by agent scaffolding. It includes test-time compute, test-time training, prompting, decomposition strategies, routing policies, memory layouts, and the structure of the input space itself. In many cases, these layers can unlock capabilities from existing models that look disproportionate to the amount of additional raw compute involved. One reason this area remains relatively neglected, in our view, is that it does not map neatly onto the institutional incentives of labs built around ever-larger pretraining runs. But if the objective is autonomous intelligence, rather than prestige through scale alone, that is not a reason to discount the layer. It is a reason to study it more seriously.
The deeper thesis is that agent capability is highly sensitive to representation. Two systems built on the same underlying model can look like very different intelligences depending on how the task is encoded, what intermediate state is surfaced, how uncertainty is externalized, and what communication protocol exists between the user and the agent or between one agent and another. This is true across modalities, and it helps explain why weaker models can sometimes outperform stronger ones when the interface is better matched to the task. For us, that is not a UX observation. It is part of the capability stack.
This also changes how we think about long-horizon autonomy. The relevant question is not simply whether an agent can act for a long time without intervention. The more important question is what kind of interaction pattern best aligns the agent with a latent objective before the autonomous phase begins. In many settings, we suspect the right model is not continuous steering, but bounded rich interaction up front that improves the duration and quality of the non-interactive phase that follows. Semantically, the analogy is closer to turning an interactive protocol into a non-interactive one: more structure at the beginning can make later autonomy both more succinct and more faithful to the underlying goal.
That framing matters for both human-agent and agent-agent systems. If communication is too thin, the system drifts onto side paths. If communication is structured well, the agent can acquire what matters early, compress the objective more faithfully, and operate for longer without repeated correction. Better orchestration, under this view, is not mainly about squeezing out a higher benchmark score. It is about designing interfaces through which autonomy can remain aligned even as dependence on continual supervision falls.
We are also interested in what might be called structured randomness. As more humans use language models, and more agents begin interacting with one another, attention becomes a congestion game. Systems that always move through a problem in exactly the same way may remain correct, but they also become repetitive, brittle, and socially legible in the wrong sense. Humans are not path-invariant in that way. They often reach similar conclusions through different trajectories, frames, and communicative styles.
For that reason, one underexplored problem is how to induce diversity without sacrificing correctness. In some domains, what matters is not only whether the model reaches the right answer, but whether it can traverse meaningfully different valid paths to get there. A mathematics problem may admit several sound derivations; a negotiation may permit several coherent rhetorical strategies; a research task may admit multiple decompositions that are all correct but differently illuminating. The goal is not randomness for its own sake. It is disciplined variation: the endpoint remains invariant even when the path varies. That kind of controlled diversity may become increasingly important in worlds where agents communicate with humans, compete for attention, or collaborate with one another over long horizons.
Seen this way, meta-orchestration is not a wrapper placed around intelligence after the fact. It is part of intelligence. Teleportability depends in part on whether state and context can be represented and resumed cleanly across environments. Immortality depends in part on whether memory and commitments survive interruption without drift. Non-interactive autonomy depends in part on whether the initial communication protocol was strong enough to preserve the latent objective. And multi-agent performance depends, to a meaningful extent, on whether interfaces between agents are expressive enough to support coordination, disagreement, repair, and variation. If the aim is autonomous intelligence rather than merely models that impress on toy problems, this layer deserves frontier-level attention.
Multi-Agent Frontiers
The deepest test of autonomous intelligence is not what a single agent can do in a sandbox. It is what many agents can do in the presence of one another, alongside humans, inside a world with conflict, ambiguity, institutions, varied incentives, and irreversible consequences. If the goal is for autonomous intelligence to become indistinguishable from humans, the standard cannot be a sterile single-agent puzzle. It has to be immersed within human society itself.
This is where current evaluation remains incomplete. Benchmarks such as τ-bench, OSWorld, and BrowseComp have moved the field forward by taking tool use, computer use, and agentic behavior more seriously than static question-answering benchmarks do 10 12. But they still mostly evaluate one agent acting on a task. They do not yet capture the full social, adversarial, heavy-tailed, and institution-laden character of the environments in which autonomous systems will actually have to operate.
What we want instead is a more demanding standard for multi-agent evaluation, especially for agents that interact with the human world.
| Desideratum | What it means in practice | Why it matters |
|---|---|---|
| Decision coupling | Outputs should change future inputs and outcomes across multi-turn episodes. | Without closed-loop consequences, the agent is only solving disconnected prompts. |
| Consequence asymmetry | Failure should be heavy-tailed; one catastrophic mistake should outweigh many easy wins. | Human environments are not linear reward landscapes. |
| Partial observability | Agents should face missing context, ambiguous instructions, and untrusted channels. | Real competence includes information gathering and verification, not just response generation. |
| Adversarial incentives | Other agents, synthetic or human, should exploit ambiguity, urgency, and social leverage. | The world is not a perfectly cooperative game; entities compete with each, with high stakes attached. |
| Constraint-saturated optimization | Tasks should include explicit policy, contract, and role constraints, including soft-violation detection. | Human systems are full of rules that matter before and after the obvious objective. |
| Stateful consistency | Long-horizon hidden state and drifting commitments should be tracked and penalized. | Durable agency requires memory, coherence, and continuity under pressure. |
| Regime variation and hidden test rotation | The environment should shift and include held-out trapdoors. | There is no equivalent of a static metric to Goodhart against in the real world. |
| Verifiability | Outputs should be auditable, structured, and tied to testable claims. | We need to know what happened, not just whether a score moved. |
| Attribution | Instrumentation should reveal why failure occurred. | Useful evals diagnose calibration, memory, policy failure, and adversarial susceptibility. |
| Human-in-the-loop anchors | Scoring should be largely automated but periodically checked by expert adjudication. | This keeps measurement scalable without letting the metric drift into fiction. |
| Safety-first boundaries | The environment should reward safe refusal or escalation when appropriate. | Recklessness should not be mistaken for competence. |
| External validity | Performance should transfer to adjacent real-world tasks. | If it does not transfer, it is a game, not an evaluation. |
These desiderata also clarify why the earlier layers matter. Privacy becomes more important once agents must negotiate, compete, and strategize. Verifiability matters more when misexecution has real consequences. Orchestration matters more when hidden state and long-horizon consistency become central. Inference speed matters more when agents must reason repeatedly under real-time pressure. The relationship should not be forced into an overly tidy schema, but the broad point is straightforward: multi-agent environments are where the rest of the stack becomes honest.
What the Future Holds for AI
The barrier between an AI agent and an autonomous entity was never intelligence alone. Frontier models already reason, code, and plan at extraordinary levels. The harder problem has been infrastructure. Every capability the agent does not securely hold for itself reintroduces a human operator, a platform dependency, or a trust assumption at the moment autonomy is supposed to begin.
Ritual’s broader goal is to remove those hidden intermediaries and turn autonomy from a metaphor into an execution layer. We want agents that can keep their own keys, retain their own memory, migrate across environments, hold assets, think privately, reach the open internet, and preserve access to intelligence itself.