Anamika Lochab
2 papers ยท Latest:
Machine Learning
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO improves diversity in RLVR by penalizing non-uniform distributions over correct solutions, boosting Pass@K while maintaining Pass@1.
2605.00365
Machine LearningAddressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft, a new rejection-sampling method, precisely controls entropy in LLM RL, preventing performance saturation and significantly boosting training gains.
2604.26326
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.