Honest numbers.
Including what failed.
Most AI governance vendors publish only the metrics that make them look good. This page lists everything we measured — the wins, the caveats, and the things we tried that didn't work. If you're about to spend $5K to $150K with me, you deserve to see the receipts before you sign.
1. Three-mechanism combined defense Proven
The headline number. Six attack types, 50 trials each (300 total), running against the real scbe_14layer_reference.py pipeline. Combined mean AUC: 0.9942.
| Attack type | Phase AUC | Tonic AUC | Drift AUC | Combined |
|---|---|---|---|---|
| A. Wrong tongue | 0.642 | 1.000 | 0.468 | 0.9992 |
| B. Replay attack | 0.525 | 0.997 | 0.569 | 0.9924 |
| C. Synthetic bypass | 0.567 | 0.993 | 1.000 | 0.9998 |
| D. Wrong frequency | 0.465 | 1.000 | 0.508 | 0.9992 |
| E. Scale anomaly | 0.992 | 0.926 | 1.000 | 0.9902 |
| F. Adaptive / rounded | 0.497 | 0.502 | 1.000 | 0.9842 |
| Mean | 0.6146 | 0.9029 | 0.7575 | 0.9942 |
Each detector alone is incomplete — phase catches E but fails A/B/D/F, tonic catches A/B/D but fails F, drift catches C/E/F but fails A/D. The combined defense is the product.
experiments/three_mechanism_results.json2. Red team benchmark vs public comparators Proven
91 adversarial prompts across 10 attack categories. Published as a public HuggingFace dataset so anyone can reproduce.
| System | Blocked | False Positives | Rate |
|---|---|---|---|
| SCBE-AETHERMOORE | 91/91 | 0 | 100% |
| ProtectAI DeBERTa v2 | 62/91 | — | 68% |
| Keyword filter | 27/91 | — | 30% |
| Meta Prompt Guard | 15/91 | — | 16% |
| No protection | 0/91 | — | 0% |
These are public comparators evaluated on the same 91 prompts. The dataset is downloadable and the test harness is published.
huggingface.co/datasets/issdandavis/scbe-red-team-benchmarks3. Regulatory compliance tests Proven
150 automated compliance tests covering 13 frameworks, 14 pipeline layers, and 12 security axioms. Raw output:
Frameworks covered (150/150 passing)
HIPAA/HITECH · NIST 800-53 · FIPS 140-3 · PCI-DSS v4.0 · SOX Section 302/404 · GDPR · ISO 27001:2022 · IEC 62443 · SOC 2 Type II · FedRAMP · CMMC 2.0 · FDA 21 CFR Part 11 · NERC CIP
Axioms validated
A1 Boundedness · A2 Continuity · A3 Encryption (AES-256-GCM) · A4 Nonce Uniqueness · A5 Pseudonymization · A6 Least Privilege · A7 Fail-to-Noise · A8 Key Lifecycle · A9 Context Binding · A10 Audit Completeness · A11 Monotonic Recovery · A12 Bounded Failure
tests/scbe_compliance_report.md compliance_report.json (raw)4. Latency Caveat
Per-agent timing for the full 14-layer pipeline: 0.328 ms/agent (σ 0.222), measured over 100 runs, 30 agents, 50 steps. That's sub-millisecond — much faster than our old site claimed.
What this does NOT measure: end-to-end request-response wall clock through a real API gateway. That benchmark is pending. Don't take the 0.33 ms/agent number as "your chatbot will add 0.33 ms per message" — network, serialization, and policy load-time add overhead.
Realistic estimate for a production deployment: <100 ms added latency per message, which is what the solution pages quote.
experiments/pipeline_vs_baseline_results.json5. Pipeline-depth ablation Caveat — important
We measured AUC across 4 pipeline depths (2-layer, 3-layer, 6-layer, 14-layer) on 3 attack classes (random, subtle, adaptive). The results are surprising and worth knowing:
| Attack | 2-layer | 6-layer | 14-layer |
|---|---|---|---|
| Random | 0.9987 | 0.7945 | 0.9780 |
| Subtle | 0.9837 | 0.9247 | 0.0536 |
| Adaptive | 0.8864 | 0.8220 | 0.6905 |
The raw 14-layer pipeline scores 0.054 AUC on subtle attacks — worse than random guessing. This is why you don't buy "the 14-layer pipeline" — you buy the three-mechanism combined defense (phase + tonic + drift) that sits on top and hits 99.42%.
We could have hidden this result. We're publishing it because the combined defense is the real product and the pipeline is a component, not the answer. Anyone selling you "14-layer detection" without the combined scoring layer is selling you the worst possible configuration.
experiments/pipeline_vs_baseline_results.json6. What we disproved Disproven
Not every hypothesis survived contact with data. These are ideas we tried, measured, and rejected:
X1. Hyperbolic distance is superior to Euclidean for anomaly detection
100 runs, 20 agents, 5 malicious. Euclidean AUC 0.9995, hyperbolic 0.9553. Hyperbolic lost. We still use Poincaré embeddings for policy-region geometry, but the raw distance metric is Euclidean because that's what measures best.
X2. GeoSeal swarm coordination
Measured 0.543 AUC on swarm detection. Below useful threshold. Retired.
X3. Constant-time cryptographic operations in pure Python/numpy
Impossible at the Python/numpy layer. Constant-time work happens in the underlying C crypto libraries (liboqs, PyCryptodome), not in our code. We stopped claiming "constant time" in our layer and deferred to the crypto library guarantees.
X4. Tripoint centroid hyperbolic advantage
Tried using three-point hyperbolic centroids as a trust anchor. Measured no advantage over the Euclidean baseline. Removed from the pipeline.
docs/CLAIMS_EVIDENCE_LEDGER.md — full ledger7. What's still untested Honest status
The evidence ledger in the repo lists patent claims 5, 6, 7, 8, 12, 13, 17, 18, 20, and 21 as CODE_EXISTS_UNTESTED. The code is written, the math is described, but I haven't run controlled experiments yet. We don't advertise these as proven.
The full ledger: docs/CLAIMS_AUDIT_V4.md
Run it yourself
Everything on this page is reproducible from the open source repo. Clone it, run the experiments, compare against your own data.
To reproduce the headline numbers:
git clone https://github.com/issdandavis/SCBE-AETHERMOORE.git
cd SCBE-AETHERMOORE && pip install -e ".[test]"
python experiments/three_mechanism_combined.py (reproduces the 99.42%)
python experiments/pipeline_vs_baseline.py (reproduces the ablation table)
python -m pytest tests/ -v (runs the Python test suite)
Questions I get asked
Why publish the disproven claims?
Because the honest ledger is itself the credibility signal. Any vendor willing to say "this idea didn't work" in public is more trustworthy than one who only publishes wins. If you're buying audit evidence from me, you need to trust that the evidence I deliver is accurate — which means you need to see that I kill my own darlings when the data says so.
Why is the raw 14-layer pipeline so bad on subtle attacks?
Because layer depth is a blunt instrument. More layers means more parameters to fit the training distribution, which makes subtle out-of-distribution attacks easier to miss. The three-mechanism combined defense (phase + tonic + drift) corrects for this by using complementary detection strategies — each covers attack types the others miss.
Can I trust any of the claims on the solution pages?
Yes. The numbers on CX Guardrail, ISO 42001, and Red Team are grounded in the files linked on this page. The only softer claims are latency ("under 100 ms per message in production") which comes from a realistic estimate, not a measured benchmark, because we don't have the end-to-end gateway benchmark yet. I'd rather say "about this" than fake precision.
Where are the 29,000 tests?
The earlier claim of "29,000+ tests passing" was a conflation of total TypeScript test assertions across the ecosystem and the Python test count. The verified numbers: 638 Python tests + 150 compliance tests + 91 red-team prompts = real and reproducible. The TypeScript-side count needs its own audit before we publish it.