SCBE Benchmark Evidence Dashboard
Evidence lanes
6/10
Readiness
60%
Proof rule
Every public claim must cite: command, artifact path, commit hash, and claim boundary.| Lane | Status | Command / Artifact | Hash | Summary / Boundary |
|---|---|---|---|---|
| hard-agentic hard agentic pretest matrix (12/14 readiness lanes) | evidence-ready | scbe bench hard-agentic --jsonartifacts/benchmarks/hard_agentic_pretest/latest_report.json | 7e0de2ea | {"blocked_or_failed":5,"executed":8,"ready_or_pass":9,"target_count":14} local readiness/pretest matrix; not a public benchmark leaderboard score |
| research BrowseComp/GAIA-style local research fixtures | evidence-ready | scbe bench research --jsonartifacts/benchmarks/research_agent_fixtures/latest_report.json | 5564f227 | {"baseline_pass_rate":0,"baseline_passes":0,"decision":"PASS","scbe_pass_rate":1,"scbe_passes":2,"task_count":2,"unresolved_tasks":[]} local BrowseComp/GAIA-style fixtures; not public BrowseComp or GAIA scores |
| rubix-browser permission-hypercube browser-control geometry fixture | evidence-ready | scbe bench rubix-browser --jsonartifacts/benchmarks/rubix_browser_hypercube/latest_report.json | b2f4096e | {"baseline_avg":0.4167,"baseline_completed":0,"baseline_illegal_moves":3,"decision":"PASS","hypercube_avg":1,"hypercube_completed":3,"hypercube_illegal_moves":0,"task_count":3} local browser-control geometry fixture; not WebArena, BrowserGym, OSWorld, or VisualWebArena score |
| arc-agi2 ARC-AGI-2 local baseline (rule-free strategies, lower bound) | missing-artifact | scbe bench arc-agi2 --jsonartifacts/benchmarks/arc_agi2_local/latest_report.json | missing | No latest artifact yet rule-free lower-bound baselines on public ARC-AGI-2 data; not a competitive ARC-AGI-2 submission score |
| arc-style-grid ARC-style grid reasoning fixture (SCBE sensor outputs) | missing-artifact | scbe bench arc-style-grid --jsonartifacts/benchmarks/arc_style_grid/latest_report.json | missing | No latest artifact yet local ARC-style grid fixture using SCBE sensor outputs; not a public ARC score |
| swe-local SWE-style local real-patch repair fixtures | missing-artifact | scbe bench swe-local --jsonartifacts/benchmarks/swe_local/latest_report.json | missing | No latest artifact yet local real-patch fixtures; not SWE-bench Verified or SWEbench.com leaderboard score |
| cli-competitive CLI command accuracy vs Codex/Claude-Code-style baselines | evidence-ready | scbe bench cli-competitive --jsonartifacts/benchmarks/cli_competitive/cli_competitive_benchmark_latest.json | 8710eb77 | {} local CLI command accuracy fixture; not a published competitive benchmark score |
| compound-decompose RDKit compound decomposition/recomposition through atom mud | evidence-ready | scbe bench compound-decompose --jsonartifacts/benchmarks/compound_decomposition_recomposition/latest_report.json | c1c9f514 | {"decision":"PASS","rdkit_available":true,"case_count":30,"passed":30,"pass_rate":1,"mud_step":5,"rdkit_error":null} computational compound decomposition/recomposition benchmark; not wet-lab synthesis, biological efficacy proof, dosing guidance, or medical advice |
| providers AI provider health matrix (local > free > paid free-first policy) | missing-artifact | scbe bench providers --jsonartifacts/benchmarks/provider_health/latest_report.json | missing | No latest artifact yet local provider reachability check; not an API reliability guarantee |
| longform Longform Bridge durable CLI workflow with squad dispatch receipts | evidence-ready | scbe bench longform --jsonartifacts/benchmarks/longform_cli_benchmark_latest.json | 40eea47c | {} local durable-workflow CLI fixture; not a guarantee of autonomous task completion |