Wei Xiong

Ph.D. candidate

University of Illinois Urbana-Champaign, Computer Science

Email: wx13 AT illinois DOT edu

I am currently a second-year Ph.D. candidate in computer science at UIUC, working with Prof. Tong Zhang and Prof. Nan Jiang.

Prior to this, I received a master's degree in mathematics in 2023 from The Hong Kong University of Science and Technology, where my study was supported by the Hong Kong PhD Fellowship. I enjoyed several fabulous years at the University of Science and Technology of China and obtained a B.S. in mathematics in 2021, where I worked closely with Prof. Cong Shen. I worked as a research intern at the deep reinforcement learning team in Microsoft Research Asia from 2021.1 to 2021.6. I also worked with Google Deepmind as a student researcher from 2024.5 to 2025.4. This summer, I will be a research intern at Meta FAIR, working with the Alignment team. Feel free to reach out if you'd like to chat about exciting topics in LLM post-training!

Research Interests

News

I organize and co-found open-source projects to make the LLM post-training techniques more accessible to the community. See the projects RLHFlow and LMFlow. Also check the publication page for a full list of research papers, codes, and slides.

My research lies at the intersection of machine learning and decision making problems. The long-term goal of my research is to build strong agents that can make adaptive decisions tailored to dynamic contexts and specified criteria.

The recent breakthroughs in foundation large language models (LLMs) have further sparked my passion to develop LLM-based agent. I seek to understand the mathematical principle behind the problems, develop scalable and efficient algorithms from the underlying theoretical insights, and make these techniques available to the community. I list some representative research problems and projects here.

Algorithm designs in reinforcement learning from human feedback (RLHF)
We propose several complementary or alternative algorithms of Proximal Policy Optimization (PPO) for the post-training of large language models, including [Iterative best-of-n fine-tuning], [Iterative direct preference learning], [Adaptive Model Averaging], and [Multi-turn direct alignment algorithms for agent learning]. In particular, our studies are the first line of works to illustrate the power of iterative and on-policy direct preference learning, as compared to the offline counterparts. These algorithms are crucial components of the modern alignment frameworks adopted by projects such as LLaMA or Qwen.
Mathematical principle of decision making porblems
1. We make the first attempt to formally formulate the RLHF as a KL-regularized contextual bandit (for Bradley-Terry model) and a KL-regularized two-player minimax game (for general preference model). We investigate its learnability in three distinct settings—offline, online, and hybrid—and propose efficient algorithms with finite-sample theoretical guarantees.
2. In [ZXZWWYZ22] and [X23], we show that the interactive decision-making problems can be reduced to a supervised online estimation, providing a unified understanding of Markov decision process (MDP), partially observable MDP, and two-player zero-sum Markov game. In particular, the widely adopted optimism principle serves to empower such a reduction and the complexity of the problem can be characterized by the cost of such a reduction. The framework covers nearly all known solvable interactive decision-making problems. This framework also leads to the development of a new optimization-based approach that achieves optimism through biased loss functions, offering an alternative to traditional constrained optimistic planning.
3. In [XZSSWZ22] and [ZXTWZWY22], we conduct initial studies on what dataset permits solving offline multi-agent RL problems. While the single-agent setting benefits from covering the single distribution induced by the optimal policy, our research reveals that covering the Nash equilibrium is not sufficient and a much stronger condition is necessary for multi-agent scenarios, even for the simplest two-player zero-sum Markov game.
Preference signal modeling
1. We open-source a series of Bradley-Terry reward model and Generative pair-wise preference model, which achieve state-of-the-art performance of their class, as evaluated by the Reward bench.
2. We study the multi-head reward modeling from different criterion like helpfulness, safety, and verbosity, as well as the algorithm design under this framework. We also develop a mixture-of-expert approach to aggregate the reward signals where a gating network decides the combination coefficient by the embedding of the prompt. Our resulting model, significantly outperforms competing models in its class with a more fine-grained reward signal, and approaches the performance of the much larger Nemotron-4 340B.
3. The resulting models have received significant attention across industry and academic circles, with either public follow-up works or private communications with major entities such as Google, Microsoft Research, Amazon, Meta FAIR, as well as universities like KAIST, Northwestern, UIUC, and Princeton.
Bringing the techniques to the community
1. I initialize and organize the RLHFlow project, which received >1K GitHub stars so far. We present a step by step guidance to do the post training of LLMs: supervised fine-tuning, reward modeling, and online iterative direct preference learning. Our training recipe is based only on the open-source dataset and the resulting model achieves state-of-the-art performance in its class.
2. I am also a founding member and core contributor of the LMFlow project, which received 8.2K GitHub stars so far. We present an extensible toolkit for finetuning and inference of LLMs. I am responsible for the RLHF part of the project.

Service

Conference Reviewer: ICLR, NeurIPS, ICML, AISTATS, EMNLP

Journal Reviewer: Journal of Machine Learning Research (JMLR), Springer Machine Learning Journal (ML), Transaction on Machine Learning Research (TMLR), Journal of the American Statistical Association (JASA).

Recent Talks

Iterative Preference Learning for Large Language Model Post Training
1. Sep 2024 Talk at the Simons Institute Workshop: Emerging Generalization Settings,
2. Sep 2024 Talk at UIUC Machine Learning Seminar
3. Aug 2024 Talk at University of Waterloo
4. July 2024 Talk at Mila Alignment Seminar
Building Math Agent by Iterative Preference Learning
1. Jan 2025, Talk at UCLA Data Mining Group
2. Nov 2024, Talk at Amazon Rufus Group
3. Oct 2024, Talk at Informs Annual Meeting
4. Oct 2024, Talk at UIUC-NLP large group meeting
5. Aug 2024 Talk at Google Deepmind Sky Team, NYC
Reinforcement Learning from Human Feedback: From Theory to Algorithm
1. Dec 2024, Guest Lecture at University of Wisconsin-Madison for CS760 Machine Learning
2. Nov 2024, Guest Lecture at UVA for CS 4501 Natural Language Processing
3. June 2024 Talk at Google Multi-turn RLHF Workshop, MTV
4. May 2024 Talk at Google Learning Theory Seminar, NYC
5. Jan 2024 Talk at Microsoft Research Asia

Contact

If you are interested in discussing with me, feel free to drop me an email or add my wechat wei_xiong2000