Francesco Belardinelli
2 papers ยท Latest:
Cryptography & Security
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
Tatemae detects LLM alignment faking by observing tool selection changes when monitoring is lifted, revealing strategic compliance.
2604.26511
Machine LearningSafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt enables provably safe policy updates in deep RL by projecting updates onto a certified safety region, preventing catastrophic forgetting of safety.
2604.09452
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.