Wei Xiong

picture 

Ph.D. candidate
University of Illinois Urbana-Champaign, Computer Science
Email: wx13 AT illinois DOT edu

[Curriculum Vitae]      Google Scholar

I am currently a third-year Ph.D. candidate in computer science at UIUC, working with Prof. Tong Zhang and Prof. Nan Jiang.

Prior to this, I received a master's degree in mathematics in 2023 from The Hong Kong University of Science and Technology, where my study was supported by the Hong Kong PhD Fellowship. I enjoyed several fabulous years at the University of Science and Technology of China and obtained a B.S. in mathematics in 2021, where I worked closely with Prof. Cong Shen.

Research Highlights

My research focuses on reinforcement learning (RL) and its applications in LLM post-training. I study both the design of core RL algorithms and the development of practical training recipes. I am also interested in understanding the training dynamics and the mathematical foundations underlying these methods, with the final goal of improving both the training stability and final model performance at scale.

I take pride in deriving rigorous mathematical insights while also devoting substantial efforts to hands-on engineering, bridging the gap between theory and practice in pursuit of elegant and scalable algorithms. The high-level mathematical insights help ensure that our practical simplifications move in the right direction, while the engineering-intensive implementation has become an essential ingredient for the success of modern LLM projects.

Some of my past work include:

  1. Online rejection-sampling fine-tuning [3][13]; [Code]

    1. a widely used and competitive algorithm for Llama and Qwen post-training;

    2. a practical recipe that competes with GRPO and provides theoretical understanding of GRPO through its connection to rejection sampling;

    3. introduces an elegant inference-budget allocation strategy motivated by the variance-reduction principle in gradient estimation.

  2. Reinforce-ada [1]; [Code]

    1. my most recent work proposes a simple yet scalable adaptive sampling framework that continuously reallocates sampling effort toward prompts with the greatest uncertainty or learning potential, addressing the signal-loss problem in GRPO training.

  3. Online DPO [10] and regret analysis of KL-regularized RL [4][10][14]; [Code1:Reward Modeling] [Code2:Online DPO]

    1. provides the first convergence and refined logarithmic regret analyses of KL-regularized RL(HF);

    2. translates theoretical insights to propose online DPO;

    3. the first open-source recipe of online DPO.

  4. I co-found and lead the open-source RLHFlow project (2,000 GitHub stars, 500 academic citations, 1M Hugging Face downloads) [7][8][9].

    1. provides a complete post-training pipeline including SFT, reward modeling, and online DPO, with final models outperforming Meta's Llama-3-8B-it;

    2. trained many widely used open-source BT reward models and preference models for RLHF;

    3. introduced the multi-head reward models with MoE-style aggregation (ARMO [8]), which have contributed to the open-source community with 200+ citations;

    4. released the first open-source recipe of (generative) process reward.

Some older work in learning theory

  • Learnability and regret analysis in multi-agent Markov games [i] [ii] [iii]

  • A unified viewpoint of learnability and generalization in online decision making [iv] [v]

  • Improved algorithm design and analysis in decentralized convex optimization [vi]

Industry Experience

  • Research Intern, Meta FAIR, 2025.5 to 2025.8

    • Taught LLMs to segment reasoning trajectories into coherent intermediate steps, improving interpretability and stability of reasoning.

    • Trained a generative process reward model via RL to evaluate and guide step-by-step reasoning.

  • Student Researcher, Google Deepmind (Gemini Post-Training Team), 2024.5 to 2025.4

    • Formulated a multi-turn RL framework for agent tasks and developed a multi-turn variant of DPO for scalable online alignment.

    • Designed robust reward modeling techniques to mitigate reward hacking and improve reliability of RLHF training.

    • Contributed to internal thinking-LLM related projects.

  • Research Intern, Microsoft Research Asia from 2021.1 to 2021.6

    • Developed RL-based bandwidth estimation algorithms for real-time communications in Microsoft Teams.

    • Conducted research on distributional reinforcement learning.

Honors and Awards

  1. Google PhD Fellowship ($85,000 USD per year), 2025

  2. Best Paper Award, Demo Track, NAACL, 2024

  3. NeurIPS Travel Award and Best Reviewer, 2023

  4. Hong Kong PhD Fellowship Scheme (HKPFS) (approx. $90,000 USD over two years), 2021 - 2023

  5. Guo Moruo Scholarship, Finalist (Highest honor for undergraduates at USTC), 2020

Service

Conference Service

  • Area Chair: NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models.

  • Organizer: NeurIPS 2025 Workshop on MATH-AI: The 5th Workshop on Mathematical Reasoning and AI.

  • Reviewer: ICLR (2024-2025), NeurIPS (2022-2024, Top Reviewer Award (Top 8%) 2023), ICML (2022-2023, 2025), AISTATS (2023-2025), ARR (2024-2025)

Journal Reviewer

  • Journal of Machine Learning Research (JMLR), Transactions on Machine Learning Research (TMLR), Journal of the American Statistical Association (JASA)

Service and Recent Talks

  1. Iterative Preference Learning for Large Language Model Post Training

    1. Sep 2024 Talk at the Simons Institute Workshop: Emerging Generalization Settings,

    2. Sep 2024 Talk at UIUC Machine Learning Seminar

    3. Aug 2024 Talk at University of Waterloo

    4. July 2024 Talk at Mila Alignment Seminar

  2. Building Math Agent by Iterative Preference Learning

    1. Jan 2025, Talk at UCLA Data Mining Group

    2. Nov 2024, Talk at Amazon Rufus Group

    3. Oct 2024, Talk at Informs Annual Meeting

    4. Oct 2024, Talk at UIUC-NLP large group meeting

    5. Aug 2024 Talk at Google Deepmind Sky Team, NYC

  3. Reinforcement Learning from Human Feedback: From Theory to Algorithm

    1. Dec 2024, Guest Lecture at University of Wisconsin-Madison for CS760 Machine Learning

    2. Nov 2024, Guest Lecture at UVA for CS 4501 Natural Language Processing

    3. June 2024 Talk at Google Multi-turn RLHF Workshop, MTV

    4. May 2024 Talk at Google Learning Theory Seminar, NYC

    5. Jan 2024 Talk at Microsoft Research Asia

Contact

If you are interested in discussing with me, feel free to drop me an email or add my wechat wei_xiong2000