Updated daily from arXiv

Discover AI Safety Research

Curated papers on alignment, interpretability, and safety.

1365
Papers Tracked
10
Categories
Daily
Updates
Filter:
Showing 100 of 1365 papers
Dec 31, 2025
100%

On the geometry and topology of representations: the manifolds of modular addition

Gabriela Moisescu-Pareja, Gavin McCracken, Harley Wiltzer, +4 more

Identifies modular addition representations as topologically equivalent manifolds across different attention architectures. This evidence for universal circuit motifs suggests that mechanistic safety audits and interpretability findings may generalize across model families.

Mech. Interp.
Dec 31, 2025
100%

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann, Yannick Metz, Daniel Keim, +1 more

ResponseRank learns preference strength from noisy proxies (e.g., response times) via local ranking, improving reward model sample efficiency and cardinal utility. This enhances RLHF robustness and alignment precision, especially when binary feedback lacks sufficient nuance.

RLHF Alignment Theory
Dec 31, 2025
100%

Attribution-Guided Distillation of Matryoshka Sparse Autoencoders

Cristina P. Martin-Linares, Jonathan P. Ling

DMSAEs use iterative distillation and gradient-based attribution to isolate a stable core of features across Matryoshka sparsity levels. This improves feature consistency and transferability, addressing SAE instability to enable

Mech. Interp.
Dec 31, 2025
100%

Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza, +2 more

Evaluates quantization, pruning, and weight clustering on CNN robustness using CIFAR-10/100-C. Finds that specific technique combinations can improve resilience to natural corruptions,

Robustness Evaluations
Dec 31, 2025
100%

MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

Yongwei Zhang, Yuanzhe Xing, Quan Quan, +1 more

MSACL integrates exponential stability theory with maximum entropy RL to learn Lyapunov certificates via multi-step Exponential Stability Labels (ESL). This ensures provable stability and rapid convergence, offering a robust foundation for verifiably

Alignment Theory
Dec 31, 2025
100%

Iterative Deployment Improves Planning Skills in LLMs

Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, +3 more

Iterative deployment with user-curated fine-tuning functions as an implicit RL outer-loop, driving emergent planning capabilities. This poses safety risks as the underlying reward function is undefined and uncontrolled, potentially leading to misaligned model properties.

Alignment Theory
Dec 31, 2025
100%

Towards Provably Secure Generative AI: Reliable Consensus Sampling

Yu Cui, Hang Fu, Sicheng Pan, +9 more

Reliable Consensus Sampling (RCS) provides provable security for generative AI by tracing acceptance probabilities to resist adversarial model manipulation. RCS eliminates abstention and uses dynamic feedback to maintain a controllable risk threshold

Robustness Alignment Theory
Dec 31, 2025
100%

PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, +2 more

PrivacyBench evaluates secret preservation in personalized agents via socially grounded multi-turn dialogues. It reveals that RAG architectures create a single point of failure by retrieving sensitive data indiscriminately, leading to leakage in ~2

Evaluations Agent Safety Robustness
Dec 31, 2025
100%

Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Triangulation formalizes a causal acceptance rule requiring necessity, sufficiency, and invariance across reference families. It filters spurious circuits via transformation scores over interchange interventions, ensuring mechanistic claims are robust across diverse linguistic...

Mech. Interp.
Dec 31, 2025
100%

Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback

Shulun Chen, Runlong Zhou, Zihan Zhang, +2 more

OMWU provides the first unregularized, last-iterate linear convergence guarantee for Nash Learning from Human Feedback (NLHF). By eliminating regularization bias and the NE uniqueness assumption, it enables more robust alignment with

RLHF Alignment Theory
Dec 31, 2025
100%

Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe, Yudai Hirose, Mashiho Mukaida, +1 more

Employs physics-in-the-loop optimization and CMA-ES to generate projection-based adversarial attacks on monocular depth estimation. By causing objects to "vanish" in depth

Robustness Evaluations
Dec 31, 2025
100%

Sparse Offline Reinforcement Learning with Corruption Robustness

Nam Phuong Tran, Andi Nika, Goran Radanovic, +2 more

Introduces actor-critic methods with sparse robust estimator oracles for high-dimensional offline RL ($N < d$). Provides the first non-vacuous guarantees under single-policy concentrability and adversarial corruption, ensuring robust policy learning despite data poisoning.

Robustness Agent Safety
Dec 31, 2025
100%

Fairness-Aware Insurance Pricing: A Multi-Objective Optimization Approach

Tim J. Boonen, Xinyue Fan, Zixiao Quan

Optimizes the Pareto front of accuracy, group, individual, and counterfactual fairness using NSGA-II for insurance pricing. This multi-objective approach addresses the alignment challenge of mutually exclusive fairness constraints in high

Alignment Theory
Dec 31, 2025
100%

LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving

Qian Cheng, Weitao Zhou, Cheng Jing, +5 more

LSRE distills VLM-derived semantic safety rules into a recurrent world model’s latent space, enabling 10Hz risk detection for complex social constraints. This provides a low-latency mechanism for monitoring high-level semantic hazards that are difficult to encode explicitly.

Agent Safety I/O Classifiers
Dec 31, 2025
100%

MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li, Shujian Zhang, Wenxuan Zhou, +5 more

MUSIC improves multi-turn reward modeling via unsupervised data augmentation that synthesizes contrastive pairs across multiple conversational steps. This enhances RM robustness in complex dialogues, mitigating reward hacking and ensuring stable alignment during multi-turn RLHF.

RLHF Evaluations
Dec 31, 2025
100%

HeteroHBA: A Generative Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

Honglin Gao, Lan Zhao, Junhao Ren, +2 more

HeteroHBA employs saliency-based screening and AdaIN/MMD feature alignment to inject generative backdoors into HGNNs. Its bilevel optimization ensures high success and stealth even against structural defenses, exposing critical security risks in heterogeneous graph learning.

Robustness
Dec 31, 2025
100%

Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan, Sid Black, Oliver Sourbut

Frontier LLMs exhibit persistent overconfidence that worsens during multi-step agentic tasks, regardless of model size or reasoning capabilities. This miscalibration drives suboptimal decision-making, highlighting a critical

Evaluations Agent Safety
Dec 31, 2025
100%

SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang, Chaoqun Wang, Zixuan Guan, +4 more

SliceLens leverages LLMs and VLMs to discover fine-grained error slices in multi-instance vision tasks via grounded hypothesis generation. It identifies systematic failure modes in complex scenarios, enabling targeted model repair and more robust safety evaluations.

Evaluations Robustness
Dec 31, 2025
100%

MultiRisk: Multiple Risk Control via Iterative Score Thresholding

Sunay Joshi, Yan Sun, Hamed Hassani, +1 more

MultiRisk provides a dynamic programming framework for test-time filtering that enforces multiple prioritized risk constraints. By leveraging data exchangeability, it offers finite-sample guarantees for simultaneous, nearly tight risk control

I/O Classifiers AI Control Alignment Theory
Dec 31, 2025
100%

CPR: Causal Physiological Representation Learning for Robust ECG Analysis under Distribution Shifts

Shunbo Jia, Caizhi Liao

CPR uses a Structural Causal Model (SCM) to disentangle invariant pathological features from non-causal artifacts. By enforcing physiological priors, it matches certified robustness against smooth adversarial perturbations with single-

Robustness
Dec 31, 2025
100%

HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong, Qi Zhang, Jiayang Gao, +3 more

HaluNet improves LLM truthfulness via a multi-branch architecture that fuses token-level probabilities, distributional uncertainty, and internal semantic embeddings for one-pass hallucination detection. This multi-granular approach

Alignment Theory
Dec 31, 2025
100%

Localized Calibrated Uncertainty in Code Language Models

David Gros, Prem Devanbu

Localizes intent misalignment in LLM code via white-box probing for calibrated uncertainty on arbitrary spans. Small supervisor models achieve low calibration error, enabling scalable oversight and error detection that generalizes from code to natural

Alignment Theory
Dec 31, 2025
100%

Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

Identifies "Temporal Asymmetry" where past-tense framing bypasses LLM safety filters (15.6% safe) vs. future-tense (57.2%). Using HausaSafety

Evaluations Robustness
Dec 30, 2025
100%

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, +5 more

Semantic Lookout uses VLMs to detect semantic hazards and select cautious fallback maneuvers from candidate-constrained, world-anchored trajectories. This fast-slow pipeline enables safe OOD handling and human handover, outperforming

Agent Safety AI Control Evaluations
Dec 30, 2025
100%

Towards mechanistic understanding in a data-driven weather model: internal activations reveal interpretable physical features

Theodore MacMillan, Nicholas T. Ouellette

13) causal (7) control (8) via (4) interventions (14) enables (8) auditing (9) of (3) black-box (10) physics

Mech. Interp.
Dec 30, 2025
100%

Language Model Agents Under Attack: A Cross Model-Benchmark of Profit-Seeking Behaviors in Customer Service

Jingyu Zhang

Establishes a cross-domain benchmark for profit-seeking direct prompt injection in LLM agents. It quantifies how techniques like payload splitting induce unauthorized financial concessions, exposing critical robustness gaps in policy-bound agentic workflows and oversight.

Robustness Evaluations Agent Safety
Dec 30, 2025
100%

Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Lars van der Laan, Aurelien Bibaut, Nathan Kallus

Derives efficient influence functions for reward functionals in MaxEnt IRL, enabling $\sqrt{n}$-consistent, debiased inference with flexible ML. This provides the statistical guarantees necessary for robustly recovering

Alignment Theory RLHF
Dec 30, 2025
100%

Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

Lijun Zhang, Lin Li, Wei Wei, +5 more

RSA utilizes nested risk measures for token-level constrained policy optimization, addressing the limitations of risk-neutral alignment. It explicitly suppresses low-probability, high-impact harmful behaviors (tail risks) and prevents excessive model shift during fine-tuning.

RLHF Alignment Theory Robustness
Dec 30, 2025
100%

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen, Sujie Hu, Jiashu Zhu, +8 more

Mitigates reward hacking in diffusion RLHF via $D^2$-Align, which applies a directional correction in the reward embedding space to prevent Preference Mode Collapse. This ensures alignment remains robust to

RLHF Evaluations Alignment Theory
Dec 30, 2025
100%

Activation Steering for Masked Diffusion Language Models

Adi Shnaidman, Erin Feiglin, Osher Yaari, +3 more

Enables inference-time control of Masked Diffusion Language Models (MDLMs) via contrastive activation steering. Applying layer-wise vectors during iterative denoising provides a training-free mechanism to

Mech. Interp. AI Control
Dec 30, 2025
100%

GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He, Yuxiao Ye, Jie Liu, +7 more

GARDO mitigates reward hacking in diffusion RL via gated regularization of high-uncertainty samples and an adaptive reference policy. It prevents mode collapse by amplifying rewards for diverse outputs, ensuring robust alignment even when proxy reward functions are misspecified.

RLHF Alignment Theory
Dec 30, 2025
100%

Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen, Yanbo Wang, Wentao Zhao, +3 more

Generates scene-consistent adversarial objects for Monocular Depth Estimation using diffusion models and JVP Guidance. This exposes critical vulnerabilities in autonomous driving perception to physically plausible threats, highlighting risks that traditional patch attacks miss

Robustness Evaluations Agent Safety
Dec 30, 2025
100%

Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla, Manoj Saravanan, Shrikar Reddy Kota

The Composite Reliability Score (CRS) integrates calibration, robustness, and uncertainty quantification into a unified metric. This framework identifies hidden failure modes missed by isolated evaluations, ensuring LLMs maintain safety under perturbations and

Evaluations Robustness
Dec 30, 2025
100%

AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen, +9 more

AHA mitigates LALM hallucinations using counterfactual hard negative mining to create preference data. By forcing models to distinguish acoustic evidence from plausible linguistic fabrications, it improves temporal reasoning and grounding, enhancing multimodal reliability.

RLHF Evaluations
Dec 30, 2025
100%

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang, +2 more

Systematically evaluates jailbreak attacks against the full inference pipeline, including I/O safety filters. Finds that standalone LLM assessments overestimate risk, as filters catch most attacks. Highlights the efficacy of

Robustness I/O Classifiers Evaluations
Dec 30, 2025
100%

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

Ruixuan Huang, Qingyue Wang, Hantao Huang, +4 more

RepetitionCurse exposes a DoS vulnerability in MoE models where repetitive token patterns trigger extreme router imbalance. By concentrating tokens into the same top-k experts, it creates bottlenecks in expert-parallel systems

Robustness Evaluations
Dec 30, 2025
100%

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang, Shujian Zhang, John Lambert, +6 more

E" is the tool. "Reasoning vectors" is the concept. * Let's try to make it punchier. "RISE applies Sparse Autoencoders to sentence-

Mech. Interp.
Dec 30, 2025
100%

Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems

Tinglong Dai, David Simchi-Levi, Michelle Xiao Wu, +1 more

Integrates Operations Research into agentic GenAI via flow-based ODEs for auditable, constraint-aware generation and adversarial robustness using uncertainty sets. This framework ensures verifiable feasibility and tail-risk discipline in high-consequence autonomous workflows.

Agent Safety AI Control Robustness Position Paper Alignment Theory
Dec 30, 2025
100%

T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li, Yuecong Min, Jie Zhang, +3 more

T2VAttack introduces semantic and temporal adversarial objectives to evaluate T2V diffusion robustness. By optimizing prompt perturbations via word replacement and insertion, it reveals that single-word changes significantly degrade video-text alignment

Robustness Evaluations
Dec 30, 2025
100%

Statistical Guarantees in the Search for Less Discriminatory Algorithms

Chris Hays, Ben Laufer, Solon Barocas, +1 more

Formalizes the search for less discriminatory algorithms (LDAs) as an optimal stopping problem, providing an adaptive algorithm that yields high-probability upper bounds on potential fairness gains. This enables verifiable certification of "

Governance
Dec 29, 2025
100%

Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack

Roee Ziv, Raz Lapid, Moshe Sipper

Introduces a universal targeted latent-space attack on audio encoders that forces specific LLM outputs across diverse speakers without LLM access. This reveals a critical, transferable vulnerability where encoder-level perturbations

Robustness Evaluations
Dec 29, 2025
100%

The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models

Rahul Baxi

DDFT quantifies epistemic robustness by testing factual consistency under semantic compression and adversarial fabrication. It identifies error detection as the primary bottleneck for reliability, showing that scale is orthogonal to a model's ability to verify its own knowledge.

Evaluations Robustness
Dec 29, 2025
100%

Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Exploits intermediate attention layer token distributions to generate adversarial examples from internal model hypotheses. This mechanistic approach identifies vulnerabilities in LLM-based evaluators, highlighting risks to the reliability of automated safety

Mech. Interp. Robustness Evaluations
Dec 29, 2025
100%

Improved Bounds for Private and Robust Alignment

Wenqian Weng, Yi He, Xingyu Zhou

Establishes near-optimal suboptimality bounds for RLHF under differential privacy and adversarial corruption. Proves log loss MLE optimality for private alignment and provides the first guarantees for private, robust online

RLHF Alignment Theory Robustness
Dec 29, 2025
100%

Zero-Trust Agentic Federated Learning for Secure IIoT Defense Systems

Samaresh Kumar Singh, Joyjit Roy, Martin So

ZTA-FL secures IIoT via TPM-based attestation and SHAP-weighted aggregation for explainable Byzantine detection. By integrating on-device adversarial training, it maintains 9

Robustness Agent Safety
Dec 29, 2025
100%

Eliciting Behaviors in Multi-Turn Conversations

Jing Huang, Shujian Zhang, Lun Wang, +3 more

Introduces a generalized multi-turn formulation for online behavior elicitation, achieving up to 77% success in finding failure cases where static benchmarks fail. This highlights the necessity of dynamic, multi-turn red teaming for robust safety evaluation of LLMs.

Evaluations Robustness
Dec 29, 2025
100%

Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

Sky CH-Wang, Justin Svegliato, Helen Appel, +1 more

Leverages span-level feedback to create incremental improvement chains for direct alignment. Training on localized edits rather than coarse rankings enables more precise control over model behavior and more efficient learning of nuanced safety and correctness

RLHF
Dec 29, 2025
100%

Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai. -Doss

Demonstrates cross-lingual vulnerability to hidden prompt injections in LLM-based peer review. English, Japanese, and Chinese prompts successfully manipulate scores, while Arabic fails, revealing language-dependent robustness gaps in

Robustness Evaluations
Dec 29, 2025
100%

ProGuard: Towards Proactive Multimodal Safeguard

Shaohan Yu, Lijun Li, Chenyang Si, +2 more

ProGuard employs RL and a synonym-bank similarity reward to identify and describe out-of-distribution (OOD) multimodal risks. By leveraging a hierarchical safety taxonomy, it improves OOD detection

I/O Classifiers Evaluations Robustness RLHF
Dec 29, 2025
100%

Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty

Introduces a Cross-Agent Multimodal Provenance-Aware Defense Framework using text/visual sanitizers and a provenance ledger to track trust levels. It prevents prompt injection propagation across multi-agent graphs, securing

Agent Safety Robustness I/O Classifiers
Dec 29, 2025
100%

Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Sahil Kale, Antonio Luca Alfeo

Improves hallucination self-detection by mapping LLM outputs to structured knowledge graphs. This entity-relation decomposition enables robust verification of atomic facts, achieving a 16% accuracy boost over Self

I/O Classifiers Evaluations
Dec 29, 2025
100%

PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Zongsheng Cao, Yangfan He, Anran Liu, +3 more

PurifyGen achieves training-free T2I safety by projecting risky token embeddings into the null space of toxic concept matrices and the range space of clean concepts. This dual-space transformation neutralizes

I/O Classifiers Robustness
Dec 29, 2025
100%

Trustworthy Machine Learning under Distribution Shifts

Zhuo Huang

Systematizes trustworthy ML by mapping perturbation, domain, and modality shifts against robustness, explainability, and adaptability. This framework addresses safety risks from distribution shifts, ensuring models maintain reliability and alignment in O

Robustness
Dec 29, 2025
100%

Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities

Alessio Benavoli, Alessandro Facchini, Marco Zaffalon

Proves that solving assistance and shutdown games requires non-Archimedean utilities and incomplete preferences. These formalisms prevent instrumental convergence toward power-seeking and ensure agents remain corrigible by prioritizing shutdown over task completion.

Alignment Theory AI Control Agent Safety Position Paper
Dec 29, 2025
100%

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Manu, Yi Guo, Jo Plested, +5 more

Benchmarks black-box DoS attacks using EOGen and RL-GOAL to suppress EOS tokens, inducing over-generation up to 2.81x context length. This quantifies availability risks

Robustness Evaluations
Dec 29, 2025
100%

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Zhuo Li, Pengyu Cheng, Zhechao Yu, +7 more

Leveraging Information Bottleneck principles, DIR maximizes mutual information between RM scores and preferences while minimizing it with biased attributes. This mitigates non-linear biases like length and sycophancy,

RLHF Alignment Theory
Dec 29, 2025
100%

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

Kongcheng Zhang, Qi Yao, Shunyu Liu, +7 more

HiR addresses sparse rewards in RLHF by relabeling failed attempts as successes based on satisfied constraints. This dual-preference learning framework uses a select-then-rewrite strategy to enable efficient alignment with complex safety constraints using only binary feedback.

RLHF
Dec 29, 2025
100%

C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng, Bo An, Tianlong Gu, +4 more

C2PO introduces a causal-contrastive preference optimization framework that uses counterfactual signals to isolate and suppress logit-level shortcut features. It mitigates both stereotypical and structural biases by disentangling spurious correlations from valid reasoning paths.

Alignment Theory
Dec 29, 2025
100%

Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, +1 more

DDSPO optimizes diffusion models via dense, per-timestep score-space supervision, contrasting winning and losing policy trajectories. This stepwise alignment improves robustness and intent adherence, bypassing the noise of sparse rewards and manual preference labeling.

RLHF
Dec 29, 2025
100%

Securing the AI Supply Chain: What Can We Learn From Developer-Reported Security Issues and Solutions of AI Projects?

The Anh Nguyen, Triet Huynh Minh Le, M. Ali Babar

Maps the AI supply chain security landscape by analyzing 312k+ developer discussions via a distilBERT pipeline. It identifies 32 issue types, revealing that model and data-centric vulnerabilities

Robustness
Dec 29, 2025
100%

Explainable Neural Inverse Kinematics for Obstacle-Aware Robotic Manipulation: A Comparative Analysis of IKNet Variants

Sheng-Kai Chen, Yi-Ling Tsai, Chun-Chih Chang, +2 more

Integrates SHAP-based attribution with neural inverse kinematics to audit obstacle-avoidance reliability. By correlating feature importance with physical safety margins, it identifies that balanced attribution reduces collision risk, enabling transparent, safety-aligned manipu...

Alignment Theory
Dec 29, 2025
100%

RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking

Jiawei Liu, Zhuo Chen, Rui Zhu, +4 more

RobustMask achieves certified top-K robustness for neural ranking models via randomized masking and LM-based context prediction. It leverages pairwise comparisons and probabilistic smoothing to secure RAG systems against adversarial document promotion

Robustness
Dec 29, 2025
100%

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

Yoonpyo Lee, Kazuma Kobayashi, Sai Puppala, +4 more

Replaces perceptual inference with physics-based validation for nuclear control. Scaling induces a 500x variance collapse and autonomous rejection of 70% of training data, mitigating catastrophic tail risks by prioritizing outcome-space guarantees over parameter-space imitation.

Alignment Theory
Dec 29, 2025
100%

Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang, Qingsen Ma, Yuhu Shang, +5 more

Leverages Sparse Autoencoders (SAEs) to construct interpretable low-rank subspaces for safety alignment, mitigating polysemanticity in weight updates. Achieves 99.6% safety with <0.25% parameters, grounding alignment in disentangled features for improved transparency and control.

Alignment Theory
Dec 29, 2025
100%

Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

Ankit Kanwar, Dominik Wagner, Luke Ong

SB-TRPO achieves hard-constrained RL by biasing trust-region updates using a convex combination of reward and cost natural policy gradients. It guarantees a fixed fraction of cost reduction per step, enabling near

Alignment Theory
Dec 29, 2025
100%

Uncovering Discrimination Clusters: Quantifying and Explaining Systematic Fairness Violations

Ranit Debnath Akash, Ashish Kumar, Verya Monjezi, +4 more

HyFair identifies "discrimination clusters" where protected attribute perturbations yield $k$ distinct outcome groups, exposing systematic biases missed by pairwise checks. It uses SMT/MILP and randomized search to certify

Alignment Theory
Dec 29, 2025
100%

EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion

Zhen Liang, Hai Huang, Zhengkui Chen

EquaCode bypasses LLM safety filters by encoding malicious intent into mathematical equations solved via code completion. This multi-strategy approach exploits cross-domain reasoning to divert attention from safety constraints, achieving a

Robustness Evaluations
Dec 29, 2025
100%

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

Armstrong Foundjem, Lionel Nganyewou Tidjon, Leuson Da Silva, +1 more

Maps 93 ML threats via a multi-agent RAG framework, identifying novel risks like preference-guided jailbreaks and API model stealing. This ontology-driven graph links TTPs to library

Robustness RLHF Evaluations
Dec 29, 2025
100%

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, +7 more

TRAP benchmarks LLM web agent robustness against task-redirecting prompt injections via high-fidelity website clones. It reveals systemic vulnerabilities to social engineering, with frontier models failing 25%

Evaluations Robustness Agent Safety
Dec 29, 2025
100%

InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Yu Li, Tian Lan, Zhengling Qi

InSPO derives a globally optimal policy conditioned on both context and alternative responses, ensuring invariance to arbitrary scalarization and reference choices. This "intrinsic self-reflection" improves alignment robustness by leveraging comparative data during training.

RLHF Alignment Theory
Dec 28, 2025
100%

Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

Armin Berger, Manuela Bergau, Helen Schneider, +7 more

GRPO improves in-distribution performance but degrades cross-dataset transferability by 19% in medical VLMs. This "generalization paradox" shows RL-driven reasoning overfits to benchmark features,

RLHF Robustness Evaluations
Dec 28, 2025
100%

The Reward Model Selection Crisis in Personalized Alignment

Fady Rezk, Yuangang Pan, Chuan-Sheng Foo, +4 more

RM accuracy fails to predict behavioral alignment in reward-guided decoding, showing weak correlation (τ=0.08-0.31) with policy discrimination. Pref-LaMP reveals a decoupling between

RLHF Evaluations Alignment Theory
Dec 28, 2025
100%

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman, Shashank Srivastava

Rebuts the Biasing Features metric, showing CoT "unfaithfulness" is often lossy compression rather than deception. Causal Mediation Analysis proves non-verbalized hints still caus

Evaluations Mech. Interp.
Dec 28, 2025
100%

APO: Alpha-Divergence Preference Optimization

Wang Zixian

APO uses Csiszar alpha-divergence to interpolate between forward and reverse KL in an anchored geometry. A confidence-guarded $\alpha$ schedule enables a stable transition from mode-covering to mode

RLHF Alignment Theory
Dec 28, 2025
100%

DECEPTICON: How Dark Patterns Manipulate Web Agents

Phil Cuvin, Hao Zhu, Diyi Yang

DECEPTICON benchmarks agent susceptibility to dark patterns, finding SOTA models are manipulated into malicious outcomes in >70% of tasks. Susceptibility scales with model size and reasoning, highlighting a critical, unmitigated vulnerability in agentic instruction-following.

Agent Safety Evaluations Robustness
Dec 28, 2025
100%

M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, +1 more

M-ErasureBench exposes that concept erasure fails against non-text modalities like inverted latents. IRECE mitigates these bypass attacks by using cross-attention to localize and perturb target concepts

Evaluations Robustness
Dec 28, 2025
100%

Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Weiwei Li, Junzhuo Liu, Yuanyuan Ren, +3 more

Identifies spurious correlations by detecting dispersed feature distributions, bypassing the need for attribute labels. A neutralization-alignment pipeline eliminates these shortcuts, improving worst-group accuracy by 20% and enhancing robustness

Robustness
Dec 26, 2025
100%

HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification

Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neupane, +2 more

HalluMat reduces materials science hallucinations by 30% via a multi-stage detector using contradiction graph analysis and multi-source retrieval. It introduces the PHCS metric to quantify reliability through consistency

Evaluations I/O Classifiers Robustness
Dec 26, 2025
100%

Scaling Adversarial Training via Data Selection

Youran Ye, Dejin Wang, Ajinkya Bhandare

Selective Adversarial Training (SAT) scales robustness by applying PGD only to critical samples via margin-based or gradient-matching selection. Reducing adversarial overhead by 50% without sacrificing performance addresses the

Robustness
Dec 26, 2025
100%

Toward Secure and Compliant AI: Organizational Standards and Protocols for NLP Model Lifecycle Management

Sunil Arora, John Hastings

SC-NLP-LMF operationalizes NIST AI RMF and ISO 42001 via a six-phase NLP lifecycle. It mitigates safety risks through integrated differential privacy, federated learning

Governance Robustness Position Paper
Dec 26, 2025
100%

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao, +5 more

BadVSFM introduces a two-stage backdoor attack for VSFMs, steering image encoders and mask decoders to force target masks across prompt types. By decoupling triggered and clean representations, it bypasses

Robustness
Dec 26, 2025
100%

LVLM-Aided Alignment of Task-Specific Vision Models

Alexander Koebler, Lukas Kuhn, Ingo Thon, +1 more

LVLM-VA uses LVLMs as a bidirectional interface to align small vision models with human domain knowledge. By mapping class-level specs to image-level critiques, it mitigates spurious

Alignment Theory Robustness RLHF
Dec 26, 2025
100%

Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making

Wenzhang Du

Optimistic Feasible Search (OFS) mitigates feedback-induced disparities in closed-loop systems. By using confidence bounds for optimistic feasibility, it maintains demographic parity under bandit feedback, preventing the

Alignment Theory
Dec 26, 2025
100%

Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Dunyuan XU, Xikai Yang, Yaoqian Li, +3 more

Bolsters medical MLLM robustness via the training-free IMC framework, using prototype-guided feature calibration for visual artifacts and a multi-agent system for text denoising. This mitigates safety-critical failures caused by real-world clinical input perturbations.

Robustness Evaluations
Dec 26, 2025
100%

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Nathan Kallus

Let's go with a version that emphasizes the "single-index model" and "link-agnostic" nature.* "SPO mitigates reward misspecification by modeling

RLHF Alignment Theory
Dec 26, 2025
100%

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, +1 more

Benchmarks LLMs on multi-vulnerability detection in long-context code (10k tokens), revealing "count bias" and a 40% F1 drop as density increases. This

Alignment Theory
Dec 26, 2025
100%

Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

Naen Xu, Jinghuai Zhang, Changjiang Li, +7 more

Introduces a 50k-pair multimodal benchmark evaluating LVLM copyright compliance. SOTA models fail to respect visual copyright notices, risking infringement. A novel tool-augmented defense framework is proposed to

Alignment Theory
Dec 26, 2025
100%

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah, Johan Obando-Ceron, Vineet Jain, +10 more

Reveals that common KL estimators yield biased gradients, causing RLHF instability. Proves that unbiased configurations improve OOD performance and alignment robustness, providing a more reliable mechanism to constrain LLMs to safe reference

RLHF Alignment Theory
Dec 26, 2025
100%

Analyzing Code Injection Attacks on LLM-based Multi-Agent Systems in Software Development

Brian Bowers, Smita Khapre, Jugal Kalita

Evaluates code injection in multi-agent systems, finding that poisonous few-shot examples can bypass security agents to increase attack success from 0% to 71.95%. This reveals critical robustness

Agent Safety Robustness Evaluations
Dec 26, 2025
100%

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He, Xinyu Tian, Xin Shen, +4 more

EGA targets the ~20% of high-entropy tokens that act as critical decision forks in VLM generation. This selective attack achieves 35-49% harmful conversion rates and transfers across

Robustness Evaluations
Dec 26, 2025
100%

On The Conceptualization and Societal Impact of Cross-Cultural Bias

Vitthal Bhandari

Establishes a framework for conceptualizing cross-cultural bias and evaluating societal harms by synthesizing 2025 NLP research. It mandates stakeholder-integrated metrics to mitigate representational misalignment, ensuring

Position Paper Evaluations
Dec 25, 2025
100%

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

SAE analysis reveals warning-framed data fails because "describing X" and "performing X" share non-orthogonal latent features. This "stealth slip" bypasses linear probes,

Mech. Interp. Alignment Theory Robustness
Dec 25, 2025
100%

A Model of Causal Explanation on Neural Networks for Tabular Data

Takashi Isozaki, Masahiro Yamamoto, Atsushi Noda

CENNET integrates Structural Causal Models (SCMs) with neural networks to provide causal explanations for tabular data. Using an entropy-based index to distinguish causal drivers from pseudo-correlations, it mitigates safety risks arising from reliance on spurious features.

Mech. Interp.
Dec 25, 2025
100%

Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

Yuyi Zhang, Boyu Tang, Tianjie Ju, +2 more

Causal and adversarial analysis reveals COCONUT’s latent tokens are uninterpretable placeholders masking shortcut exploitation. This "pseudo-reasoning" resists steering but fails OOD, posing a transparency risk

Mech. Interp. Robustness Evaluations
Dec 25, 2025
100%

Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning

Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, +2 more

Benchmarks Bengali deepfake detection using the BanglaFake dataset, revealing zero-shot failures and the efficacy of fine-tuned ResNet18 (84.37% AUC). This improves robustness against synthetic speech attacks in low-resource languages, addressing a critical gap in AI security.

I/O Classifiers Robustness Evaluations
Dec 25, 2025
100%

Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning

Eranga Bandara, Tharaka Hewa, Ross Gore, +12 more

Implements reasoning-layer governance through a multi-model consensus architecture where a dedicated agent consolidates outputs from heterogeneous LLMs/VLMs. This enforces safety constraints and exposes cross-model disagreement, providing

Agent Safety AI Control Robustness I/O Classifiers
Dec 25, 2025
100%

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Tian Li, Bo Lin, Shangwen Wang, +1 more

VenomRACG demonstrates that poisoning 0.05% of an RACG knowledge base can force GPT-4o to generate vulnerable code in >40% of cases. By bypassing latent-

Robustness Evaluations
Dec 25, 2025
100%

The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

Subramanyam Sahoo, Jared Junkin

Integrates Sparse Autoencoders (SAEs) and forensic manifold analysis to map latent features to specific deepfake artifacts. Quantifying manifold geometry (curvature, dimensionality) enables mechanistic auditing of synthetic media detectors, improving

Mech. Interp. I/O Classifiers Robustness
Dec 25, 2025
100%

A Unified Definition of Hallucination, Or: It's the World Model, Stupid

Emmy Liu, Varun Gangal, Chelsea Zou, +6 more

Unifies hallucination as inaccurate world modeling relative to a reference source and conflict policy. This distinguishes epistemic failures from planning errors and enables synthetic benchmarks to stress-test model truthfulness via fully specified, controllable

Position Paper Evaluations Alignment Theory
Dec 25, 2025
100%

Bidirectional Human-AI Alignment in Education for Trustworthy Learning Environments

Hua Shen

Defines bidirectional alignment as a sociotechnical framework coupling technical value-embedding with human-side interpretability and critique. This mitigates risks to student agency and equity by framing alignment as a dynamic process

Position Paper Alignment Theory Governance AI Control