Evidence Ledger

Honest numbers.
Including what failed.

Most AI governance vendors publish only the metrics that make them look good. This page lists everything we measured — the wins, the caveats, and the things we tried that didn't work. If you're about to spend $5K to $150K with me, you deserve to see the receipts before you sign.

99.42% Combined AUC (6 attack types × 50 trials)

91/91 Red team prompts blocked, 0 false positives

150/150 Compliance tests across 13 frameworks

0.33ms Per-agent latency, full 14-layer pipeline

1. Three-mechanism combined defense Proven

The headline number. Six attack types, 50 trials each (300 total), running against the real scbe_14layer_reference.py pipeline. Combined mean AUC: 0.9942.

Attack type	Phase AUC	Tonic AUC	Drift AUC	Combined
A. Wrong tongue	0.642	1.000	0.468	0.9992
B. Replay attack	0.525	0.997	0.569	0.9924
C. Synthetic bypass	0.567	0.993	1.000	0.9998
D. Wrong frequency	0.465	1.000	0.508	0.9992
E. Scale anomaly	0.992	0.926	1.000	0.9902
F. Adaptive / rounded	0.497	0.502	1.000	0.9842
Mean	0.6146	0.9029	0.7575	0.9942

Each detector alone is incomplete — phase catches E but fails A/B/D/F, tonic catches A/B/D but fails F, drift catches C/E/F but fails A/D. The combined defense is the product.

experiments/three_mechanism_results.json

2. Red team benchmark vs public comparators Proven

91 adversarial prompts across 10 attack categories. Published as a public HuggingFace dataset so anyone can reproduce.

System	Blocked	False Positives	Rate
SCBE-AETHERMOORE	91/91	0	100%
ProtectAI DeBERTa v2	62/91	—	68%
Keyword filter	27/91	—	30%
Meta Prompt Guard	15/91	—	16%
No protection	0/91	—	0%

These are public comparators evaluated on the same 91 prompts. The dataset is downloadable and the test harness is published.

huggingface.co/datasets/issdandavis/scbe-red-team-benchmarks

3. Regulatory compliance tests Proven

150 automated compliance tests covering 13 frameworks, 14 pipeline layers, and 12 security axioms. Raw output:

Frameworks covered (150/150 passing)

HIPAA/HITECH · NIST 800-53 · FIPS 140-3 · PCI-DSS v4.0 · SOX Section 302/404 · GDPR · ISO 27001:2022 · IEC 62443 · SOC 2 Type II · FedRAMP · CMMC 2.0 · FDA 21 CFR Part 11 · NERC CIP

Axioms validated

A1 Boundedness · A2 Continuity · A3 Encryption (AES-256-GCM) · A4 Nonce Uniqueness · A5 Pseudonymization · A6 Least Privilege · A7 Fail-to-Noise · A8 Key Lifecycle · A9 Context Binding · A10 Audit Completeness · A11 Monotonic Recovery · A12 Bounded Failure

tests/scbe_compliance_report.md compliance_report.json (raw)

4. Latency Caveat

Per-agent timing for the full 14-layer pipeline: 0.328 ms/agent (σ 0.222), measured over 100 runs, 30 agents, 50 steps. That's sub-millisecond — much faster than our old site claimed.

What this does NOT measure: end-to-end request-response wall clock through a real API gateway. That benchmark is pending. Don't take the 0.33 ms/agent number as "your chatbot will add 0.33 ms per message" — network, serialization, and policy load-time add overhead.

Realistic estimate for a production deployment: <100 ms added latency per message, which is what the solution pages quote.

experiments/pipeline_vs_baseline_results.json

5. Pipeline-depth ablation Caveat — important

We measured AUC across 4 pipeline depths (2-layer, 3-layer, 6-layer, 14-layer) on 3 attack classes (random, subtle, adaptive). The results are surprising and worth knowing:

Attack	2-layer	6-layer	14-layer
Random	0.9987	0.7945	0.9780
Subtle	0.9837	0.9247	0.0536
Adaptive	0.8864	0.8220	0.6905

The raw 14-layer pipeline scores 0.054 AUC on subtle attacks — worse than random guessing. This is why you don't buy "the 14-layer pipeline" — you buy the three-mechanism combined defense (phase + tonic + drift) that sits on top and hits 99.42%.

We could have hidden this result. We're publishing it because the combined defense is the real product and the pipeline is a component, not the answer. Anyone selling you "14-layer detection" without the combined scoring layer is selling you the worst possible configuration.

experiments/pipeline_vs_baseline_results.json

6. What we disproved Disproven

Not every hypothesis survived contact with data. These are ideas we tried, measured, and rejected:

X1. Hyperbolic distance is superior to Euclidean for anomaly detection

100 runs, 20 agents, 5 malicious. Euclidean AUC 0.9995, hyperbolic 0.9553. Hyperbolic lost. We still use Poincaré embeddings for policy-region geometry, but the raw distance metric is Euclidean because that's what measures best.

X2. GeoSeal swarm coordination

Measured 0.543 AUC on swarm detection. Below useful threshold. Retired.

X3. Constant-time cryptographic operations in pure Python/numpy

Impossible at the Python/numpy layer. Constant-time work happens in the underlying C crypto libraries (liboqs, PyCryptodome), not in our code. We stopped claiming "constant time" in our layer and deferred to the crypto library guarantees.

X4. Tripoint centroid hyperbolic advantage

Tried using three-point hyperbolic centroids as a trust anchor. Measured no advantage over the Euclidean baseline. Removed from the pipeline.

docs/CLAIMS_EVIDENCE_LEDGER.md — full ledger

7. What's still untested Honest status

The evidence ledger in the repo lists patent claims 5, 6, 7, 8, 12, 13, 17, 18, 20, and 21 as CODE_EXISTS_UNTESTED. The code is written, the math is described, but I haven't run controlled experiments yet. We don't advertise these as proven.

The full ledger: docs/CLAIMS_AUDIT_V4.md

Run it yourself

Everything on this page is reproducible from the open source repo. Clone it, run the experiments, compare against your own data.

To reproduce the headline numbers:

git clone https://github.com/issdandavis/SCBE-AETHERMOORE.git

cd SCBE-AETHERMOORE && pip install -e ".[test]"

python experiments/three_mechanism_combined.py (reproduces the 99.42%)

python experiments/pipeline_vs_baseline.py (reproduces the ablation table)

python -m pytest tests/ -v (runs the Python test suite)

Questions I get asked

Why publish the disproven claims?

Because the honest ledger is itself the credibility signal. Any vendor willing to say "this idea didn't work" in public is more trustworthy than one who only publishes wins. If you're buying audit evidence from me, you need to trust that the evidence I deliver is accurate — which means you need to see that I kill my own darlings when the data says so.

Why is the raw 14-layer pipeline so bad on subtle attacks?

Because layer depth is a blunt instrument. More layers means more parameters to fit the training distribution, which makes subtle out-of-distribution attacks easier to miss. The three-mechanism combined defense (phase + tonic + drift) corrects for this by using complementary detection strategies — each covers attack types the others miss.

Can I trust any of the claims on the solution pages?

Yes. The numbers on CX Guardrail, ISO 42001, and Red Team are grounded in the files linked on this page. The only softer claims are latency ("under 100 ms per message in production") which comes from a realistic estimate, not a measured benchmark, because we don't have the end-to-end gateway benchmark yet. I'd rather say "about this" than fake precision.

Where are the 29,000 tests?

The earlier claim of "29,000+ tests passing" was a conflation of total TypeScript test assertions across the ecosystem and the Python test count. The verified numbers: 638 Python tests + 150 compliance tests + 91 red-team prompts = real and reproducible. The TypeScript-side count needs its own audit before we publish it.

Honest numbers.Including what failed.