Wei Xiong

picture 

Ph.D. student

University of Illinois Urbana-Champaign, Computer Science

Email: wx13 AT illinois DOT edu

[Curriculum Vitae]      [Google Scholar]

I am currently a first-year Ph.D. student in computer science at UIUC, working with Prof. Tong Zhang and Prof. Nan Jiang. I also concurrently work with Google Deepmind as a full-time/20% part-time student researcher since 2024.5.

Prior to this, I received a master's degree in mathematics in 2023 from The Hong Kong University of Science and Technology, where my study was supported by the Hong Kong PhD Fellowship. I enjoyed several fabulous years at the University of Science and Technology of China and obtained a B.S. in mathematics in 2021, where I worked closely with Prof. Cong Shen. I worked as a research intern at the deep reinforcement learning team in Microsoft Research Asia from 2021.1 to 2021.6.

Research Interests

My research lies at the intersection of machine learning and decision making problems. The long-term goal of my research is to build strong agents that can make adaptive decisions tailored to dynamic contexts and specified criteria.

The recent breakthroughs in foundation large language models (LLMs) have further sparked my passion to develop LLM-based agent. I seek to understand the mathematical principle behind the problems, develop scalable and efficient algorithms from the underlying theoretical insights, and make these techniques available to the community. I list some representative research problems and projects here.

  1. Algorithm designs in reinforcement learning from human feedback (RLHF)
    We propose several complementary or alternative algorithms of Proximal Policy Optimization (PPO) for the post-training of large language models, including [Iterative best-of-n fine-tuning], [Iterative direct preference learning], [Adaptive Model Averaging], and [Multi-turn direct alignment algorithms for agent learning]. In particular, our studies are the first line of works to illustrate the power of iterative and on-policy direct preference learning, as compared to the offline counterparts. These algorithms are crucial components of the modern alignment frameworks adopted by projects such as LLaMA or Qwen.

  2. Mathematical principle of decision making porblems

    1. We make the first attempt to formally formulate the RLHF as a KL-regularized contextual bandit (for Bradley-Terry model) and a KL-regularized two-player minimax game (for general preference model). We investigate its learnability in three distinct settings—offline, online, and hybrid—and propose efficient algorithms with finite-sample theoretical guarantees.

    2. In [ZXZWWYZ22] and [X23], we show that the interactive decision-making problems can be reduced to a supervised online estimation, providing a unified understanding of Markov decision process (MDP), partially observable MDP, and two-player zero-sum Markov game. In particular, the widely adopted optimism principle serves to empower such a reduction and the complexity of the problem can be characterized by the cost of such a reduction. The framework covers nearly all known solvable interactive decision-making problems. This framework also leads to the development of a new optimization-based approach that achieves optimism through biased loss functions, offering an alternative to traditional constrained optimistic planning.

    3. In [XZSSWZ22] and [ZXTWZWY22], we conduct initial studies on what dataset permits solving offline multi-agent RL problems. While the single-agent setting benefits from covering the single distribution induced by the optimal policy, our research reveals that covering the Nash equilibrium is not sufficient and a much stronger condition is necessary for multi-agent scenarios, even for the simplest two-player zero-sum Markov game.
  3. Preference signal modeling

    1. We open-source a series of Bradley-Terry reward model and pair-wise preference model, which achieve state-of-the-art performance of their class, as evaluated by the Reward bench.

    2. We study the multi-head reward modeling from different criterion like helpfulness, safety, and verbosity, as well as the algorithm design under this framework. We also develop a mixture-of-expert approach to aggregate the reward signals where a gating network decides the combination coefficient by the embedding of the prompt. Our resulting model, significantly outperforms competing models in its class with a more fine-grained reward signal, and approaches the performance of the much larger Nemotron-4 340B.

    3. The resulting models have received significant attention across industry and academic circles, with either public follow-up works or private communications with major entities such as Google, Microsoft Research, Amazon, Meta FAIR, as well as universities like KAIST, Northwestern, UIUC, and Princeton.
  4. Bringing the techniques to the community

    1. I initialize and organize the RLHFlow project, which received ~1K GitHub stars so far. We present a step by step guidance to do the post training of LLMs: supervised fine-tuning, reward modeling, and online iterative direct preference learning. Our training recipe is based only on the open-source dataset and the resulting model achieves state-of-the-art performance in its class.

    2. I am also a founding member and core contributor of the LMFlow project, which received 8.2K GitHub stars so far. We present an extensible toolkit for finetuning and inference of LLMs. I am responsible for the RLHF part of the project.

Contact

If you are interested in discussing with me, feel free to drop me an email or add my wechat wei_xiong2000