ArXiv TLDR

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

🐦 Tweet
2604.09452

Maksim Anisimov, Francesco Belardinelli, Matthew Wicker

cs.LGcs.AI

TLDR

SafeAdapt enables provably safe policy updates in deep RL by projecting updates onto a certified safety region, preventing catastrophic forgetting of safety.

Key contributions

  • Introduces SafeAdapt, an a priori approach for provably safe policy updates in continual RL.
  • Defines the Rashomon set, a certified region in policy parameter space ensuring safety within demonstration data.
  • Guarantees safety by projecting arbitrary RL algorithm updates onto the Rashomon set.
  • Prevents catastrophic forgetting of safety constraints during adaptation, unlike regularization baselines.

Why it matters

This paper addresses a critical challenge in deploying RL: maintaining safety during policy updates in dynamic environments. By introducing the Rashomon set and a projection mechanism, it offers a general method for provably safe adaptation. This enables more reliable and trustworthy RL systems for real-world safety-critical applications.

Original Abstract

Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.