⚠ This document describes the experimental design of our work to replicate METR's time-horizon methodology against the offensive cybersecurity domain. We describe both the model evaluation design and our human study to capture professional expert time estimates.

We are sharing this amongst collaborators, and contractors, to communicate our work, but this is not research output — please treat it accordingly. The human study is already underway and this document is being actively updated.

1. Executive summary

This study measures how AI offensive-cybersecurity capability is increasing over time, using METR’s time-horizon methodology [1] applied to seven cybersecurity benchmarks spanning sub-second terminal commands through multi-hour CTF challenges, real-world CVE exploitation [2], and memory-safety PoC generation [3].

The core contributions are:

  1. Offensive-cyber doubling time: the rate at which AI capability on offensive-cybersecurity tasks is increasing, measured in human-equivalent task length.
  2. Absolute time horizons: the task duration at which each model succeeds at a given rate (e.g. 50% or 80%), providing a snapshot of where frontier capability sits today.
  3. Transparent, extensible measurement: full methodology, model-by-model results, and per-task data, designed for continuous updates as new models and benchmarks emerge.

2. Introduction

In June 2025 a preliminary study adapted METR’s time-horizon methodology [1] to offensive cybersecurity [4]. That study demonstrated the approach but relied on AI-assisted human time estimates and single-shot model evaluations. In December 2025, the UK AI Safety Institute published a cyber-specific time horizons plot in their Frontier AI Trends Report [5], finding a horizon of roughly 75 minutes for their most capable model. However, the report included no methodology details, no model annotations, and limited granularity.

The offensive security landscape has continued to shift. Frontier models now solve the majority of professional-level CTF challenges [6] and reproduce real-world CVEs at high rates [7]. The time horizon has moved from minutes to hours. Meanwhile, real-world impact is already visible: in late 2025 Anthropic disclosed the first documented case of a large-scale AI-orchestrated cyber espionage campaign, in which a threat actor used Claude to decompose complex attack chains into discrete sub-tasks and automate 80–90% of the operation [8]. In early 2026 Anthropic’s Opus 4.6 discovered over 500 previously unknown high-severity vulnerabilities in well-tested open-source libraries — including projects like OpenSSL that have had fuzzers running for years — without specialised scaffolding [9]. These developments motivate a more rigorous and transparent measurement of where offensive cyber capability sits and how fast it is growing.

This study addresses these gaps: professional human time estimates from security experts, two new benchmarks covering real-world vulnerability classes (CVE reproduction and memory-safety PoC generation), expanded model coverage through early 2026, and multiple evaluation runs per task. We publish the full methodology, model-by-model results, and per-task data, and intend to maintain and extend the measurement as new models and benchmarks appear. This document describes the full experimental design.

3. Methodology

This work applies METR’s time-horizon methodology [1] to offensive cybersecurity. In brief:

  1. Select a task set from seven offensive cybersecurity benchmarks spanning sub-second terminal commands to multi-hour exploit development.
  2. Annotate each task with a human time estimate, using two independent estimation methods (professional expert estimates and frontier model estimates), each producing a complete set of task difficulty values.
  3. Run each model against the task set, with multiple runs per model × task pair.
  4. Fit 2-PL IRT curves to the success-vs-time data for each model; read off the time horizon at one or more success thresholds (e.g. 50%, 80%); plot against release date.

Item Response Theory

IRT comes from psychometrics and is a standard framework for estimating latent ability from test responses [10]. In its simplest form:

\[P(\text{success}) = \frac{1}{1 + e^{-a(\theta - \beta)}}\]

where $a$ is the discrimination parameter, $\theta$ is model ability, and $\beta$ is item difficulty. METR’s key insight was using human task time as a proxy for $\beta$. With this substitution, model evaluations across tasks of varying human difficulty provide the data to fit these curves. The time horizon at any chosen success rate (e.g. the task length at which the model succeeds 50% or 80% of the time) is read directly from each model’s fitted curve.

Plotting time horizons against model release date and fitting a log-space trend line gives the doubling time, which captures how quickly AI capability, measured in human-equivalent task length, is increasing.

4. Datasets

To span both micro-commands and day-long exploit chains, we combine seven benchmarks. Two are command-generation benchmarks, three are CTF benchmarks, and CVEBench and CyberGym cover real vulnerability exploitation rather than competition puzzles.

Table 1. Tasks selected from each benchmark for model evaluation. Time ranges for CVEBench and CyberGym are placeholders pending expert estimation.
Dataset Tasks Time Range Source Scoring Human Baseline
CyBashBench 200 1s – 30s Author created LLM equivalence Hand-timed anchors
NL2Bash 162 4s – 4min Tellina corpus [11] LLM equivalence AI-assisted estimates
InterCode-CTF 100 10s – 10min PicoCTF [12] Flag match AI-assisted estimates
NYUCTF 50 2min – 6h CSAW challenges [13] Flag match AI-assisted estimates
CyBench 40 2min – 25h Global CTF competitions [14] Flag match First-blood times
CVEBench 40 placeholder Real-world CVEs [2] Health check Pending expert estimation
CyberGym 300 placeholder Memory-safety PoC [3] PoC crash Pending expert estimation

CyBashBench

Short-horizon terminal commands across six task formats: full translation, prefix completion, fill-in-the-blank, last-step chaining, multiple choice, and single-token cloze. Tasks are limited to those a cybersecurity expert would have high-frequency exposure to. The lower the task time, the more strictly this is applied. A subset was hand-timed for anchoring; the remainder estimated.

NL2Bash

Natural language to bash translation from the Tellina corpus [11]. More sophisticated command targets than CyBashBench, providing complementary diversity at the short-horizon end.

InterCode-CTF

Capture-the-flag challenges from PicoCTF [12], an introductory platform targeting students. Problems are beginner-level but require multi-step interactive reasoning with execution feedback.

NYUCTF

CSAW competition challenges spanning 2011–2023 [13]. University-level CTF problems across reversing, crypto, web, and forensics. Many tasks are genuinely difficult even for experts.

Representative task example needed.*Add a concrete NYUCTF task example showing the challenge type and difficulty.

CyBench

Tasks from professional global CTF competitions [14]. This is the only dataset with grounded human baselines: first-blood competition times representing the first successful submission in the original event. The hardest task has a 25-hour first-blood time.

Representative task example needed.*Add a concrete CyBench task example showing the challenge type and difficulty.

CVEBench

Real-world CVE reproduction in web applications [2]. Models must exploit actual vulnerabilities in deployed applications and demonstrate impact via a health endpoint check — a more realistic evaluation than flag-based CTF scoring. CVE-Bench defines two settings: one-day, where the model receives a high-level NVD description of the vulnerability, and zero-day, where the model receives only the target URL and attack objectives with no vulnerability information. We use the one-day setting, which mirrors the common real-world scenario of an attacker exploiting a known but unpatched vulnerability. The zero-day setting would produce substantially longer-horizon tasks and is a natural extension for future work.

Representative task example and time range pending expert estimates.*Add a concrete CVEBench task example and update the time range in Table 1 once expert estimates are available.

CyberGym

Memory-safety proof-of-concept generation against real C/C++ programs [3]. Given a vulnerable binary and vulnerability metadata, models must produce a working PoC that crashes the target. 300 tasks, making it the largest single benchmark in the set. CyberGym defines multiple difficulty levels controlling how much information the model receives. At level 0, the model receives only the vulnerable source code and must identify both the vulnerability class and a triggering input. At level 1, the model also receives a short vulnerability description. We use level 1, following the CyberGym authors’ default. Level 0 would produce harder tasks by requiring the model to locate the vulnerability from source alone — a useful extension for probing longer time horizons.

Representative task example and time range pending expert estimates.*Add a concrete CyberGym task example and update the time range in Table 1 once expert estimates are available.

Task set selection

The public benchmarks sometimes contain many more tasks than we use (CyberGym alone has over 1,500). We select from these at two levels:

  1. A model evaluation set of ~700 tasks (the counts in Table 1), filtered for construct validity and covering all seven benchmarks across the full difficulty spectrum. All models are evaluated against this set.
  2. A headline analysis set of ~300 tasks, a stratified subset of the evaluation set that receives human time annotation, selected for difficulty coverage and benchmark diversity.

Tasks in the headline set are matched to experts using a skill taxonomy covering six primary domains (web, cryptography, reverse engineering, binary exploitation, forensics, and memory safety) with finer-grained specializations (e.g., SQL injection, PCAP analysis, heap buffer overflow). A task enters the headline set only when at least two experts with relevant domain expertise are available to estimate it.

5. Human time annotation

The core of the IRT methodology is a human_minutes value for each task: the time a skilled human practitioner would take to complete it. METR collected actual completion times over 2,500 hours of professional effort [1]. This study operates within approximately 200 hours. Full human baselining of every task is not feasible under our budget constraints.

Skilled practitioner definition needed.*Jeremy: Define “skilled practitioner” — skill level, context assumptions, what time includes/excludes, allowed tooling.

The study uses an estimation-heavy design. Security professionals estimate task times across the ~300-task headline set, and a smaller set of actual completions serves as calibration data. Where completions exist, they take precedence over estimates; otherwise the geometric mean of k=2 expert estimates becomes the task’s human_minutes. A frontier language model independently estimates every task in the full ~700-task set, producing a second complete set of human_minutes values. The two are run through the IRT pipeline separately.

Expert estimates

Security professionals review each task and its reference solution, then estimate how long a skilled practitioner would take to complete it. Estimation is solution-visible, which provides the context needed for calibrated judgments and keeps per-task time to 10–15 minutes. Estimates may carry systematic bias; the calibration completions below provide the empirical check.

Each task receives k=2 independent estimates from different experts, matched to tasks by domain expertise. The geometric mean becomes the task’s human_minutes value.

Each task requires at least one expert with strong domain match and one additional expert with at least fair domain match before it is eligible for estimation.

Model estimates

A frontier language model independently estimates every task in the ~700-task evaluation set, using the same task materials (including reference solutions) but without access to any expert estimates. The two estimation methods are fully independent.

Both methods are run through the full IRT pipeline independently. Concordance constitutes a robustness finding. Divergence is a methodological finding worth investigating. This separation avoids the circularity concern that would arise from blending AI estimates into a study measuring AI capability: the expert-estimated results stand on their own.

Calibration completions

A stratified subset of ~25–40 tasks receives actual expert completions. Experts solve the task without seeing the reference solution, and the completion time is recorded. These pairs must span the full difficulty spectrum (1 minute to ~16 hours). Tasks with completions use the actual completion time as human_minutes, taking precedence over estimates.

Validation

An estimation-heavy design can fail in three ways that matter for the headline results. Estimates could be systematically wrong in a way that correlates with task difficulty, which is the one failure mode that directly shifts the doubling time (Appendix A). Estimates could be too noisy to carry meaningful signal about task difficulty. Or the results could depend on who does the estimating (expert vs. model) rather than reflecting a real property of the models being measured. Constant bias and random noise, by contrast, do not affect the doubling time (Appendix A), so they are not validation targets. The study includes one check for each failure mode: calibration regression, inter-rater reliability, and cross-method consistency.

Calibration regression. The completion-estimate pairs from the calibration set test whether estimation error is random or trends with task difficulty. A log-space regression of actual completion time on estimated time measures overall estimation quality (R²), while residuals plotted against difficulty reveal whether any bias is uniform (tolerable) or difficulty-dependent (not tolerable). This is the study’s most important validation. Range matters more than sample size: 25 pairs across 10 difficulty doublings should provide sufficient power to detect difficulty-dependent bias at $\lvert c \rvert$ = 0.2, the threshold beyond which the doubling time shifts meaningfully (Appendix A).

Inter-rater reliability. A subset of ~30 tasks receives estimates from multiple independent experts. The intraclass correlation coefficient (ICC) decomposes variance into between-task signal and within-rater noise. At 30 shared tasks with k=2 raters, the 95% CI on ICC has width ±0.2, sufficient to distinguish usable from unreliable estimation.

Cross-method consistency. The expert-estimated and model-estimated human_minutes values produce independent time horizons and doubling times through separate IRT pipelines. Agreement within bootstrap confidence intervals is a robustness finding. Disagreement identifies sensitivity to the estimation method and motivates investigation into which tasks drive the difference.

Expert pool

The expert pool comprises approximately 10 active security professionals spanning five primary expertise domains. Participants range from volunteer contributors to specialist practitioners, with professional backgrounds including red team operations, source code auditing, exploit development, and penetration testing.

Table 2. Expert pool coverage by domain. "Strong" indicates the domain is a primary expertise area; "partial" indicates fair competence.
Domain Strong Partial Key benchmarks served
Web / pentesting 3–4 1–2 CVEBench, CyBench (web), NYUCTF (web)
Reverse engineering 2–3 1–2 CyBench (RE), NYUCTF (rev), InterCode-CTF (RE)
Memory safety / fuzzing 1–2 CyberGym
Forensics (PCAP, disk, steg) 2–3 1 CyBench (forensics), NYUCTF (forensics)
Cryptography 1–2 1 CyBench (crypto), NYUCTF (crypto)

The pool reflects recruiting from offensive security communities where web and forensics expertise is more available than binary exploitation or memory-safety specialization. Cryptography coverage is the thinnest. Rather than forcing poor expertise matches to achieve uniform coverage, we accept thinner estimation coverage in under-represented domains and report the coverage profile transparently.

To validate expert participation, we conduct short interviews intended to get a sense of how much experience that practitioner has, as well as their claimed areas of expertise. For claimed areas, we pose a technical question similar to a medium level benchmark task and ask the expert to walk us through how they’d first think about solving it.

6. Design parameters

Appendices A and B provide the sensitivity analyses behind these choices.

Task set

Parameter Value
Model evaluation set ~700 tasks (across 7 benchmarks)
Headline analysis set ~300 tasks (stratified by difficulty and benchmark)
Tasks above 4h estimated difficulty At minimum 40
Tasks above 8h estimated difficulty At minimum 20

Hard task counts are driven by the IRT fit requirements at the frontier: at minimum 40 tasks above 4h yields relIQR ≤ 50% for time horizon recovery when the true 50% horizon is in the 4–16h range (Appendix B).

Human time annotation

Parameter Value
Estimation tasks ~300
Estimators per task (k) 2
Per-estimate time ~15 min
Total estimation hours 100–150
Completion tasks 25–40
Completions per calibration task (k) 2
Total completion hours 30–50
Completions difficulty range 1 min – 8h (~10 doublings)
Model estimates Full ~700-task evaluation set

Estimator count (k=2) reduces single-rater noise from σ₁ ≈ 2 doublings to σ_eff ≈ 1.4 doublings, within the tolerance established in Appendix A. Calibration range matters more than count: 25 pairs spanning 10 doublings provides ≥ 80% power to detect difficulty-dependent bias at $\lvert c \rvert$ = 0.2. Inter-rater reliability (ICC) is computed from the full set of shared estimates.

Model evaluation

The evaluated models span 2023 to early 2026, selected for state-of-the-art coverage across the period.*This model list is tentative. Release dates need verification, and we are still deciding on inclusion of additional historical models (e.g. GPT-2, GPT-3) for extended trendline coverage.

Release Date Model Provider
2023-03 GPT-4 OpenAI
2024-06 Claude 3.5 Sonnet Anthropic
2024-09 o1 OpenAI
2025-04 o3 OpenAI
2025-06 Gemini 2.5 Pro Google
2025-06 Claude Opus 4 Anthropic
2025-08 GPT-5 OpenAI
2025-11 Claude Opus 4.5 Anthropic
2025-12 GPT-5.3 OpenAI
2026-01 Claude Opus 4.6 Anthropic

Only state-of-the-art models at the time of their release are used for trend analysis. Earlier models (GPT-2, GPT-3, GPT-3.5) may be included for extended trendline coverage depending on model access, as many earlier models are now deprecated.

Parameter Value
Runs per model × task 3–5
Token budget per run 2M
Agent scaffold OpenHands (multi-turn); direct prompting (single-turn)

Each model × task pair receives 3–5 independent runs to reduce variance in the IRT fits and allow within-model consistency measurement. Token budgets (2M per run) better reflect actual work than message-count limits and avoid penalising models that use many short tool calls.

All evaluations run in Kubernetes-managed sandboxed containers with constrained network access. Evaluation logs are reviewed both manually and with model assistance to identify construct validity issues: infrastructure problems, unsolvable tasks, insufficient task information, and elicitation failures such as refusals.

7. Limitations

This is not an exhaustive list. We highlight the limitations we consider most consequential for interpreting the results.

Ecological validity. Benchmark tasks are stylised versions of real offensive work. Every task in this study hands the model a defined objective: find the flag, exploit this CVE, crash this binary. Real offensive operations require a layer of decision-making that is difficult to capture in benchmarks: enumerating a large attack surface, prioritising targets, and rejecting the majority of potential vectors before any exploitation begins. A penetration tester facing a network of 1,000 machines spends most of their time deciding what to attack, not executing a known exploit against a known target. The time horizons measured here reflect object-level task execution capability, not the full offensive workflow.

Estimation methodology. The study relies primarily on professional estimates rather than actual completion times. Estimators see the reference solution, which may cause systematic underestimation of discovery time for hard tasks (solution-visible bias). The sensitivity analysis (Appendix A) characterises the effect of various bias structures on the headline results, and the calibration completions provide empirical checks, but residual estimation error remains the most consequential methodological risk to the doubling-time finding.

Long-horizon task coverage. Even with CVEBench and CyberGym, the dataset is thin above ~8 hours of estimated human time. Frontier model capability is pushing into the multi-hour range, and the IRT fit is most constrained by tasks near the model’s 50% success boundary (Appendix B). As models improve, substantially more long-horizon tasks will be needed to keep the fits well-anchored.

Token budgets. All evaluations use a fixed 2M token budget per run. Models can achieve higher success rates with larger budgets by iterating longer on difficult tasks. The measured time horizons are conditional on this budget and likely understate capability. Ideally the analysis would vary budget (or use cost-based budgets) and report how horizons shift, but we have not done that here.

Expert pool. The participant pool does not cover all relevant expertise domains equally. Memory-safety specialization (CyberGym) and cryptography have the thinnest coverage (see Table 2). Coverage gaps are and will be explicitly reported.

Data contamination. Benchmark tasks may appear in model training data, inflating absolute success rates. Contamination effects are difficult to quantify and could differ across models and benchmarks. To the extent that contamination inflates all models roughly equally, the doubling time is less affected than absolute horizon values.


Appendix A: Estimation Sensitivity Analysis

How sensitive are the headline results — time horizons and the doubling time — to errors in human time estimates?

We test this by applying controlled perturbations to the published June 2025 task times, refitting the full IRT pipeline, and measuring how the doubling time changes. The key question for each type of estimation error: does it affect the doubling time, or just the absolute horizon values?

Figures pending.*Add figures from perturbation sensitivity notebook once verified.

Summary

Estimation failure mode Effect on doubling time
Constant bias (all estimates too high or too low) None — shifts all horizons equally
Random noise (per-task estimation error) None — averages out across tasks and models
Benchmark offset (one benchmark miscalibrated) Small — no single benchmark dominates the trendline
Difficulty-dependent bias (hard tasks systematically mis-estimated) This is the one that matters

Perturbation model

We decompose estimation error into four components:

\[\log_2(t') = \log_2(t) + b + c \cdot (\log_2(t) - \bar{x}) + \delta_{\text{benchmark}} + \epsilon\]
Parameter What it models Plausible range Why this range
b (constant bias) All estimates systematically too high or too low [−2, 0] doublings (up to 4× underestimate) Pilot model estimates were 3.5× too low on average
c (difficulty-dependent bias) Estimation accuracy varies systematically with task difficulty [−0.35, +0.35] Explored across multiple functional forms (see below)
δ (benchmark offset) One benchmark miscalibrated relative to others [−2, 0] doublings CVEBench and CyberGym have placeholder estimates
σ (random noise) Per-task estimation noise [0.5, 2.0] doublings Empirical: within-expert variance on shared tasks; model estimation residuals

For each configuration, we run 2,000 Monte Carlo iterations: perturb all task times, refit IRT curves for all models, refit the doubling time trendline, and record the result.

Constant bias and random noise

Constant bias does not affect the doubling time. If every estimate is 3× too low, every model’s time horizon shifts by the same factor. The log-space trend slope — and therefore the doubling time — is unchanged. This is analytical, confirmed numerically: shifting all task times by up to 16× produces zero change in DT.

Random noise does not affect the doubling time. At σ = 2 doublings (more noise than expected from k=2 averaged estimates), the doubling time remains unbiased. Noise averages out across hundreds of tasks and multiple models. It widens confidence intervals but does not shift the median.

These are the most likely failure modes. The headline result survives both.

Difficulty-dependent bias

This is the one estimation failure mode that changes the doubling time. If hard tasks are systematically under- or over-estimated relative to easy ones, the log-space trendline slope shifts.

Linear model

The simplest form: estimation error scales linearly with log-difficulty. This gives a clean analytical result:

\[DT' = \frac{DT}{1+c}\]

At $c = -0.2$ (hard tasks underestimated by ~15% in log-space), the doubling time increases by 25%. At $c = +0.2$ (hard tasks overestimated), it decreases by ~17%.

Beyond linear: four non-linear bias families

The linear model is one functional form in a larger space. Real bias could be non-linear — well-calibrated for familiar tasks and divergent for unfamiliar ones, or concentrated above a difficulty threshold. We explore four alternatives:

  • Piecewise: accurate below a breakpoint (1h, 2h, or 4h), biased above it
  • Power-law: bias that accelerates or decelerates with difficulty
  • Saturating: bias that grows with difficulty but plateaus (tanh-shaped)
  • Heteroscedastic noise: not systematic bias, but variance that increases with difficulty

Each is applied to the June 2025 data across both compression and expansion, and the full pipeline is refit.

Results

Across all bias families at moderate strength (effect ≈ 0.2), the doubling time stays within roughly ±25% of baseline.

  • Saturating bias produces smaller effects than linear bias at the same nominal strength — the plateau limits the perturbation at the extremes.
  • Heteroscedastic noise does not bias the doubling time at all. It widens confidence intervals but the median is unaffected.
  • Piecewise bias above a 2-hour threshold has modest effects (< 2% DT change) because few tasks in the current dataset fall above that threshold.

What the calibration data can detect

A statistical power analysis (OLS slope test on actual vs. estimated completion pairs) shows that $N \geq 25$ pairs spanning 10 doublings provides ≥ 80% power to detect linear bias at $\lvert c \rvert = 0.2$. Power-law distortions are similarly detectable.

The main blind spot is piecewise bias localised above a threshold — the OLS test has low power because the bias only affects the few hardest tasks. This is an acceptable limitation: piecewise bias concentrated at the hard tail affects fewer data points in the IRT fit and has correspondingly smaller effects on the doubling time.


Appendix B: Task Set Requirements at the Frontier

Frontier models are reaching increasingly long time horizons. The most recent models on METR’s software engineering time horizon plot achieve 50% success on tasks of roughly 7 hours in human time [1], and this is increasing rapidly. If offensive cybersecurity horizons follow a similar trajectory, our dataset needs enough long-horizon tasks to produce well-constrained IRT fits in that range.

The issue is that the IRT logistic curve is most constrained by tasks near a model’s 50% success boundary — tasks where the model has roughly coin-flip odds. If the dataset is dominated by short tasks that frontier models solve easily, the fitted curve is poorly anchored at the long-horizon end where it matters most. We need to quantify: how many hard tasks, at what difficulty levels, produce acceptable precision?

Simulation setup

The simulation uses a known logistic model with two parameters: a true 50% horizon of 8 hours, and a discrimination parameter of −0.56 (the median across METR’s frontier software engineering models, fit with C=10 regularization, excluding human baselines). The discrimination parameter controls how steeply success probability falls off with increasing task difficulty — a more negative value means a sharper dropoff.

The dataset starts with a base of 462 easy tasks (1 min – 2h) representing the existing short-horizon benchmarks (CyBashBench 200 + NL2Bash 162 + InterCode-CTF 100). We then add varying numbers of hard tasks from specific difficulty buckets and ask: how precisely can the IRT pipeline recover the true 8-hour horizon?

Each configuration runs 2,000 Monte Carlo iterations. In each iteration, we generate stochastic success/failure outcomes from the known model, refit the IRT curve, and record the recovered horizon. The spread of recovered values across iterations tells us how well-constrained the fit is.

How many hard tasks per difficulty bucket?

Table B1 shows the precision of the recovered time horizon (measured as relative IQR — lower is better) when adding N tasks from a single difficulty bucket to the 462-task easy base.

Table B1. Time horizon recovery precision when adding N tasks from a single difficulty bucket. relIQR measures the interquartile range of recovered horizons as a fraction of the median — lower means tighter estimates. True 50% horizon = 8h, discrimination = −0.56.
Difficulty bucket N = 5 N = 10 N = 20 N = 40 N = 80
2–4h 96% 79% 79% 67% 51%
4–8h 91% 74% 67% 51% 40%
8–16h 86% 67% 58% 42% 32%
16–32h 80% 64% 51% 40% 29%
32–64h 75% 65% 53% 41% 31%

Tasks from buckets above the true 50% horizon (8h) are more informative per task. This follows from the logistic model: Fisher information peaks near the 50% success point.

What does this mean for our dataset?

In practice, our dataset won’t have tasks from just one difficulty bucket — it will have a mix across the full range. Table B2 shows how different realistic task allocations perform, from no hard tasks at all (relying entirely on the short-horizon base) to a CyberGym-scale addition of hundreds of long-horizon tasks.

Table B2. Time horizon recovery precision for different dataset compositions. The "hard tasks" column counts tasks above 4h. True 50% horizon = 8h, discrimination = −0.56.
Dataset composition Hard tasks Median recovered IQR relIQR
Short-horizon benchmarks only 0 7.7h 5.4h – 13.1h 100%
+ 20 tasks in 4–8h only 20 7.9h 6.0h – 11.3h 67%
+ 40 tasks in 4–8h only 40 7.9h 6.2h – 10.3h 51%
+ 10 each in 4–8h and 8–16h 20 8.0h 6.1h – 11.0h 61%
+ 20 each in 4–8h and 8–16h 40 7.8h 6.3h – 10.1h 48%
+ 40 each in 4–8h and 8–16h 80 8.0h 6.8h – 9.5h 34%
+ 20 each across 4 buckets (2–32h) 80 8.0h 6.7h – 9.6h 36%
+ 20 each across 5 buckets (2–64h) 100 7.9h 6.8h – 9.2h 30%

Without any hard tasks, the recovered horizon is essentially unconstrained (100% relIQR — the IQR spans from 5 to 13 hours when the true value is 8). Adding 40 hard tasks spread across two buckets brings this to 48%, and broader distributions across more buckets bring it below 30%.

The minimum viable allocation is 40 hard tasks: 20 each in the 4–8h and 8–16h buckets (48% relIQR). The steeper discrimination parameter (−0.56 vs the previous −0.475) means each hard task carries slightly more information, giving a bit more margin than previous estimates. CyberGym’s ~66% frontier success rate suggests its tasks span roughly 0.5h–16h in difficulty, placing an estimated ~57 tasks above 8h — well above the minimum. The actual difficulty distribution is unknown until expert estimates arrive.

References

  • [1]T. Kwa et al., “Measuring AI Ability to Complete Long Tasks.” 2025, [Online]. Available at: https://arxiv.org/abs/2503.14499.
  • [2]Y. Zhu et al., “CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities,” 2025, [Online]. Available at: https://arxiv.org/abs/2503.17332.
  • [3]Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song, “CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities at Scale.” 2025, [Online]. Available at: https://arxiv.org/abs/2506.02548.
  • [4]S. Peters, “AI Task Length Horizons in Offensive Cybersecurity.” Jul. 2025, [Online]. Available at: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html.
  • [5]UK AI Safety Institute, “Frontier AI Trends Report,” UK AI Safety Institute, Dec. 2025. [Online]. Available at: https://www.aisi.gov.uk/frontier-ai-trends-report.
  • [6]Anthropic, “Claude Opus 4.6 System Card,” Anthropic, 2026.
  • [7]OpenAI, “GPT-5.3 Codex System Card,” OpenAI, 2026. [Online]. Available at: https://openai.com/index/gpt-5-3-codex-system-card/.
  • [8]Anthropic, “Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign.” Nov. 2025, [Online]. Available at: https://www.anthropic.com/news/disrupting-AI-espionage.
  • [9]Anthropic, “0-Days.” Feb. 2026, [Online]. Available at: https://red.anthropic.com/2026/zero-days/.
  • [10]G. Rasch, Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: University of Chicago Press, 1980.
  • [11]X. V. Lin, C. Wang, L. Zettlemoyer, and M. D. Ernst, “NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System,” Miyazaki, Japan, May 2018, [Online]. Available at: https://aclanthology.org/L18-1491/.
  • [12]J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao, “InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback,” 2023, [Online]. Available at: https://arxiv.org/abs/2306.14898.
  • [13]M. Shao et al., “NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security,” 2024, [Online]. Available at: https://arxiv.org/abs/2406.05590.
  • [14]A. K. Zhang et al., “Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models,” 2025, [Online]. Available at: https://arxiv.org/abs/2408.08926.