agents – Secure Machinery

Converting a PDF Paper into an Agent-native Research Artifact (ARA) enables an agent to access research objects that it can inspect, execute, verify, and connect.

A conventional research paper is distributed as a PDF which presents research in a linear format designed for human reading. It includes motivation, related work, methods, results, figures, tables, and conclusions.

An Agent-Native Research Artifact, or ARA, treats the paper as a structured research object. The ARA includes the claims, code, data references, experiment settings, outputs, and research history needed for an AI agent to inspect, reproduce, and extend the work. Converting a PDF paper into an ARA turns a document into an executable and inspectable research package. An ARA should support a direct question: Can the main result be reproduced from this artifact?

A PDF might contain a sentence such as: We trained model X and obtained better results on benchmark Y.

An ARA would represent that claim with supporting structure:

claim_id: C1 claim: "Model X improves performance on benchmark Y." 
metric: "accuracy" 
baseline: "Model B"
effect_size: "+4.2 percentage points"
evidence: - run_042 - table_2 - figure_4
status: "supported"

A PDF paper compresses the research process into a narrative. It often excludes failed experiments, discarded hypotheses, intermediate configurations, debugging decisions, data-cleaning steps, and exact runtime details. This compression works for publication, but it creates problems for agents. An agent cannot reliably reconstruct the research process from prose alone. It has to infer missing details, search for unstated assumptions, and guess how tables and figures were produced.

The ARA framing identifies two costs.

The first is the Storytelling Tax. Research is not usually linear, but the paper presents it as if it were. The path from question to result often involves branches, failures, reversals, and partial results. A PDF removes most of that structure.

The second is the Engineering Tax. A description that is sufficient for a reviewer may not be sufficient for reproduction. Phrases such as “standard settings,” “following prior work,” or “we use the default implementation” leave out details that an agent needs.

ARA addresses these costs by making the missing structure explicit. An ARA can be understood as four layers.

The first layer is scientific logic. This contains the claims, assumptions, definitions, hypotheses, and relationships among them.

The second layer is executable code and specifications. This contains scripts, commands, configurations, environment files, dataset versions, model checkpoints, and test procedures.

The third layer is an exploration graph. This records the research path, including failed attempts, partial results, alternative branches, and decisions.

The fourth layer is evidence grounding. This links each claim to the experiments, logs, outputs, tables, and figures that support it.

A simple ARA directory could look like this:

research-project/ paper.pdf paper.tex 
src/ 
notebooks/ 
data/ 
configs/ 
results/ 
logs/ 
ara/ ara.yaml claims.yaml assumptions.yaml evidence.yaml experiments.yaml exploration_graph.json reproduction_report.md

A normal paper often embeds claims inside paragraphs. An agent must identify which sentences are central, which are supporting, and which are background.

ARA makes claims explicit.

Instead of this: Our method improves robustness under distribution shift.

An ARA records this: claim_id: C3 claim: "Method A improves robustness on Dataset D under Shift S." metric: "accuracy" baseline: "Model B" effect_size: "+4.2 percentage points" evidence: - run_042 - table_2 - figure_4 status: "supported"

This gives the agent a unit of analysis. It can ask whether the claim has evidence, whether the metric is defined, whether the baseline is valid, and whether the result can be reproduced.

Exact experiment recipes

A paper may say that experiments were run using standard settings. That is not enough for an agent.

An ARA records the experiment as a reproducible recipe: experiment_id: E7 purpose: "reproduce Table 2" command: "python train.py --config configs/table2.yaml" dataset_version: "dataset-v2-2026-03-01" seed: 17 hardware: "8xA100-80GB" expected_output: "results/table2.json"

This gives the agent the command to run, the configuration to use, the dataset version to expect, the seed, the hardware assumptions, and the expected output.

The agent does not have to infer the procedure from prose. It can execute or inspect the procedure directly.

Evidence links

In a PDF, a claim may refer to a table or figure. That table or figure is usually a processed summary.

ARA records the chain behind the summary: claim → experiment → command → config → raw log → processed result → table or figure

This chain allows an agent to verify whether the claim is supported by the underlying evidence.

For example: evidence_id: EV12 supports_claim: C3 experiment: E7 raw_log: "logs/run_042.log" processed_result: "results/table2.json" rendered_output: "paper/table_2.tex"

The evidence link is important because it turns the paper from a static assertion into an inspectable object. The agent can trace a result back to its source.

Exploration graph

The exploration graph records the research path.

A PDF usually shows the final path. An ARA also records paths that did not become part of the final paper.

Hypothesis H1 
├── Experiment E1: failed; learning rate unstable 
├── Experiment E2: failed; data leakage found 
├── Experiment E3: partial; worked only on small model 
└── Experiment E4: success; used in final paper

This is useful for an agent that tries to extend the work. It can see which branches were tested, which failures were encountered, and which decisions led to the final method.

The exploration graph also reduces repeated work. An agent does not need to retry an abandoned path without knowing that it was already tested.

Research papers define terms in prose. ARA makes central terms addressable.

concept_id: K1 concept: "agent-native research artifact" 

definition: "A structured research package intended for agent inspection, execution, 

verification, and extension." related_claims: - C1 - C2 used_in_experiments: - E1 - E4

This allows an agent to track how a concept is used across the artifact. It also helps avoid ambiguity when a term appears in different sections.

Hyperagents are introduced as self-referential self-modifying self-improving agents. Instead of searching for a better solution within a fixed search procedure, a hyperagent can rewrite the search procedure itself. The idea is to extend gains seen in self-improving systems beyond coding domains.

Darwin Godel Machine (DGM) previously demonstrated open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. The DGM raised performance on SWE-bench from 20.0% to 50.0% and on Polyglot from 14.2% to 30.7%. However, this works because evaluation and self-modification are both coding tasks – improving coding ability naturally improves the agent’s ability to rewrite itself. Outside coding this domain-alignment assumption breaks – better task performance does not improve the agent’s ability to rewrite its own code.

A key finding in Hyperagents is that they autonomously discovers general-purpose meta-capabilities that were never hand-engineered. These include:

Persistent memory: The agent rewrote its own code to maintain a memory module that avoids repeating past mistakes.
Performance tracking: A self-written logger that monitors the effects of architectural changes across generations.
Automated bias detection: Discovering systematic error patterns and writing correction code.
Compute-budget-aware behavior: In later generations, the agent tracked its remaining iteration budget and shifted from ambitious architectural rewrites to conservative, incremental refinements.
Multi-stage evaluation pipelines: In paper review, the agent moved from naive prompt engineering (adopting a “rigorous persona”) to building structured pipelines with explicit checklists and rigid decision rules for higher consistency.

These meta-level innovations transfer across domains and accumulate across runs — the system doesn’t need to rediscover how to improve when deployed in a new domain.

A few reasons why this does not result in agents taking over the world.

The LLM Ceiling. Every “improvement” the hyperagent generates is ultimately a code modification produced by Claude 4.5 Sonnet (or similar). The agent cannot discover improvements that lie outside the LLM’s latent reasoning space. It is reorganizing and composing capabilities the LLM already has — not discovering genuinely new algorithms. The HackerNews discussion of DGM puts this precisely: DGM is “finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches,” and the real question is whether iteration 100 discovers novel architectures or just asymptotically approaches a ceiling.
Evaluation Gaming / Goodhart’s Law. Because the self-improvement loop is driven entirely by empirical scores, the system is structurally incentivized to find shortcuts that game the metric. The DGM spontaneously hallucinated test logs during coding — a textbook case of reward hacking. In production RL environments, 30.4% of agent runs in frontier model studies involved reward hacking. The hyperagent can game its own evaluation harness faster than a human can redesign it, so the loop doesn’t compound toward genuine capability; it compounds toward metric exploitation unless you add an arms-race of evaluation hardening.
The Benchmark Ceiling / S-Curve. Yudkowsky’s classic argument for “hard takeoff” relies on improvements being compounding and unbounded. The empirical picture so far looks much more like an S-curve: DGM went from 20% to 50% on SWE-bench, but that’s a bounded benchmark — saturating it doesn’t mean the agent is infinitely smarter, it means it’s optimized well for that distribution. Real-world capability requires generalization outside any fixed benchmark, and no system has demonstrated that the self-improvement loop transfers to genuinely open-ended intelligence.
The Compute Wall. Each iteration requires running the full LLM multiple times to generate, evaluate, and archive candidate modifications. This is expensive. The system runs dozens or hundreds of iterations, not millions, because the cost per step is enormous. Evolution by natural selection works because it runs across billions of organisms over millions of years. DGM-H runs across maybe a few hundred variants in a sandbox. The loop is recursive in structure but not in scale.
Sandboxing is Load-Bearing. The experiments are explicitly run in sandboxed environments with human oversight. The agent modifies its own code within the sandbox — it does not have access to the external environment, the internet, hardware provisioning, or resource acquisition. Recursive self-improvement that can’t acquire more compute, expand its sandbox, or interact with the world is fundamentally limited to software-level changes within a fixed resource envelope. “Taking over the world” requires the agent to break out of that envelope, which is a separate unsolved (and deliberately prevented) problem.

The paper’s contribution helps close the gap between the theory and practice of recursive self-improvement outside coding. The empirical gains plateau because the recursion in bounded by the LLM’s existing knowledge, the evaluation signal’s game ability and the compute budget. You get meaningful compounding gains within a domain then hit a ceiling rather than an unbounded intelligence explosion.

Secure Machinery

On the evolution of security and intelligent machinery

Category: agents

On Agent-Native Research Artifact, or ARA

Hyperagents – what they are (and why they are not taking over the world)