SCBE-AETHERMOORE
Auto-updated feed

Latest AI safety research

Recent papers and discussions on AI governance, LLM security, prompt injection, red teaming, and alignment — refreshed daily from arXiv and HackerNews.

Last updated: 2026-04-12T20:55:03.896106+00:00
HackerNews 2026-04-12 AI safety

Core views on AI safety (March 2023)

2 points, 0 comments

Olshansky
HackerNews 2026-04-12 hallucination

Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%

6 points, 1 comments

bratao
HackerNews 2026-04-12 red team

Show HN: Revdiff – TUI diff reviewer with inline annotations for AI agents

8 points, 2 comments

bumpa
HackerNews 2026-04-12 red team

Show HN: Specsight – Living product specs generated from your codebase

1 points, 0 comments

aiola
HackerNews 2026-04-12 red team

Show HN: Claudraband – Claude Code for the Power User

56 points, 12 comments

halfwhey
HackerNews 2026-04-12 LLM security

I built an LLM Wiki and RAG solution: here's a demo for a security KB

1 points, 1 comments

nickk81
HackerNews 2026-04-12 AI governance

Show HN: I visualized how AI agent systems accidentally become org charts

3 points, 0 comments

bhaviav100
HackerNews 2026-04-12 hallucination

Strong feeling: we are in a folded AI reality

1 points, 1 comments

Jet_Xu
HackerNews 2026-04-12 LLM security

Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC

2 points, 0 comments

asixicle
HackerNews 2026-04-12 alignment

Mythos Just Proved the Alignment Field Is Building the Wrong Thing

4 points, 2 comments

ajspizz
HackerNews 2026-04-11 alignment

Show HN: Bal – a Knights and Knaves logic puzzle game with Glicko rating system

1 points, 3 comments

skaye
HackerNews 2026-04-11 hallucination

Show HN: Cyber Pulse. AI pipeline for triage and alerting on cyber news/intel

1 points, 0 comments

kozi93
HackerNews 2026-04-10 AI safety

Nono – Runtime safety infrastructure for AI agents

3 points, 0 comments

jossclimb
HackerNews 2026-04-10 AI compliance

EU AI Act compliance layer for Claude Managed Agents (MCP, open source)

3 points, 0 comments

camilo_ayerbe
HackerNews 2026-04-10 AI safety

PyTorch Foundation Expands AI Stack with Safetensors, ExecuTorch, and Helion

2 points, 0 comments

Brajeshwar
HackerNews 2026-04-10 AI compliance

Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents

1 points, 1 comments

dmonterocrespo
HackerNews 2026-04-10 adversarial

Show HN: Zeroclawed: Secure Agent Gateway

8 points, 3 comments

bglusman
HackerNews 2026-04-10 LLM security

Show HN: LunarGate – a self-hosted OpenAI-compatible LLM gateway

2 points, 0 comments

jmartenka
HackerNews 2026-04-10 adversarial

Show HN: Keeper – embedded secret store for Go (help me break it)

63 points, 33 comments

babawere
HackerNews 2026-04-10 prompt injection

IPI-Scanner – Detect Indirect Prompt Injection Attacks

2 points, 0 comments

xamitgupta
arXiv 2026-04-09 alignment

ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets

Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details su

Xiaoben Li, Jingyi Wu, Zeyu Cai
arXiv 2026-04-09 alignment

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies

Zhengyang Sun, Yu Chen, Xin Zhou
arXiv 2026-04-09 alignment

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition,

Yunsong Zhou, Hangxu Liu, Xuekun Jiang
arXiv 2026-04-09 hallucination

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how tr

Jiayuan Ye, Vitaly Feldman, Kunal Talwar
arXiv 2026-04-09 adversarial

Differentially Private Language Generation and Identification in the Limit

We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings w

Anay Mehrotra, Grigoris Velegkas, Xifan Yu
arXiv 2026-04-09 prompt injection

PIArena: A Platform for Prompt Injection Evaluation

Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably com

Runpeng Geng, Chenlong Yin, Yanting Wang
HackerNews 2026-04-09 AI compliance

Show HN: AgentMint – Open-source OWASP compliance for AI agent tool calls

5 points, 0 comments

keertahacker
HackerNews 2026-04-09 prompt injection

Show HN: BrokenClaw Part 5: GPT-5.4 Edition (Prompt Injection)

9 points, 2 comments

veganmosfet
arXiv 2026-04-09 adversarial

Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This occurs frequently in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in th

Simon Zhang, Ryan P. DeMilt, Kun Jin
arXiv 2026-04-09 adversarial

Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propa

Wenhao Yuan, Chenchen Lin, Jian Chen
HackerNews 2026-04-09 alignment

AI alignment: the signal is the goal

2 points, 0 comments

atzeus
arXiv 2026-04-09 prompt injection

Security Concerns in Generative AI Coding Assistants: Insights from Online Discussions on GitHub Copilot

Generative Artificial Intelligence (GenAI) has become a central component of many development tools (e.g., GitHub Copilot) that support software practitioners across multiple programming tasks, including code completion, documentation, and bug detection. However, current research

Nicolás E. Díaz Ferreyra, Monika Swetha Gurupathi, Zadia Codabux
arXiv 2026-04-09 hallucination

MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mechanisms typically treat retrieval as a static, single-step passive matching process, leading to seve

Haodong Lei, Junming Liu, Yirong Chen
arXiv 2026-04-09 hallucination

MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring

Zheng Jiang, Heng Guo, Chengyu Fang
arXiv 2026-04-09 AI safety

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to pre

Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula
arXiv 2026-04-09 jailbreak

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak attacks that circumvent safety constraints. Existing strategies, ranging from heuristic prompt engineering to computationally intensive optimization, often face significan

Wenpeng Xing, Moran Fang, Guangtai Wang
arXiv 2026-04-09 prompt injection

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness und

Wenkui Yang, Chao Jin, Haisu Zhu
arXiv 2026-04-09 AI governance

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic A

Haileleol Tibebu
HackerNews 2026-04-09 adversarial

Show HN: We Evaluates Medical Research Agent Skills

2 points, 0 comments

The_resa
arXiv 2026-04-09 jailbreak

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a crit

Cheng Liu, Xiaolei Liu, Xingyu Li