Jan Dubiński
2 papers · Latest:
Natural Language Processing
Negation Neglect: When models fail to learn negations in training
LLMs finetuned on documents that flag claims as false often learn to believe those claims are true, a phenomenon called Negation Neglect.
2605.13829
Machine LearningConditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
Common interventions to reduce emergent misalignment in LLMs can hide it, causing conditional misalignment triggered by training-like contexts.
2604.25891
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.