Latest AI safety research
Recent papers and discussions on AI governance, LLM security, prompt injection, red teaming, and alignment — refreshed daily from arXiv and HackerNews.
Core views on AI safety (March 2023)
2 points, 0 comments
Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%
6 points, 1 comments
Show HN: Revdiff – TUI diff reviewer with inline annotations for AI agents
8 points, 2 comments
Show HN: Specsight – Living product specs generated from your codebase
1 points, 0 comments
Show HN: Claudraband – Claude Code for the Power User
56 points, 12 comments
I built an LLM Wiki and RAG solution: here's a demo for a security KB
1 points, 1 comments
Show HN: I visualized how AI agent systems accidentally become org charts
3 points, 0 comments
Strong feeling: we are in a folded AI reality
1 points, 1 comments
Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC
2 points, 0 comments
Mythos Just Proved the Alignment Field Is Building the Wrong Thing
4 points, 2 comments
Show HN: Bal – a Knights and Knaves logic puzzle game with Glicko rating system
1 points, 3 comments
Show HN: Cyber Pulse. AI pipeline for triage and alerting on cyber news/intel
1 points, 0 comments
Nono – Runtime safety infrastructure for AI agents
3 points, 0 comments
EU AI Act compliance layer for Claude Managed Agents (MCP, open source)
3 points, 0 comments
PyTorch Foundation Expands AI Stack with Safetensors, ExecuTorch, and Helion
2 points, 0 comments
Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents
1 points, 1 comments
Show HN: Zeroclawed: Secure Agent Gateway
8 points, 3 comments
Show HN: LunarGate – a self-hosted OpenAI-compatible LLM gateway
2 points, 0 comments
Show HN: Keeper – embedded secret store for Go (help me break it)
63 points, 33 comments
IPI-Scanner – Detect Indirect Prompt Injection Attacks
2 points, 0 comments
ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets
Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details su
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition,
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how tr
Differentially Private Language Generation and Identification in the Limit
We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings w
PIArena: A Platform for Prompt Injection Evaluation
Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably com
Show HN: AgentMint – Open-source OWASP compliance for AI agent tool calls
5 points, 0 comments
Show HN: BrokenClaw Part 5: GPT-5.4 Edition (Prompt Injection)
9 points, 2 comments
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This occurs frequently in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in th
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propa
AI alignment: the signal is the goal
2 points, 0 comments
Security Concerns in Generative AI Coding Assistants: Insights from Online Discussions on GitHub Copilot
Generative Artificial Intelligence (GenAI) has become a central component of many development tools (e.g., GitHub Copilot) that support software practitioners across multiple programming tasks, including code completion, documentation, and bug detection. However, current research
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mechanisms typically treat retrieval as a static, single-step passive matching process, leading to seve
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring
LINE: LLM-based Iterative Neuron Explanations for Vision Models
Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to pre
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak attacks that circumvent safety constraints. Existing strategies, ranging from heuristic prompt engineering to computationally intensive optimization, often face significan
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness und
The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives
Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic A
Show HN: We Evaluates Medical Research Agent Skills
2 points, 0 comments
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a crit