Wei Xiong


Ph.D. student

University of Illinois Urbana-Champaign, Computer Science

[Curriculum Vitae]      [Google Scholar]

I am currently a first-year Ph.D. student in computer science at UIUC, working with Prof. Tong Zhang and Prof. Nan Jiang. Prior to this, I received a master's degree in mathematics in 2023 from The Hong Kong University of Science and Technology. I received a B.S. in mathematics from the University of Science and Technology of China in 2021, where I worked closely with Prof. Cong Shen.

Research Interests

My research interests focus on reinforcement learning from human feedback (RLHF) for aligning large language model recently.


We formulate the real-world RLHF process as a reverse-KL regularized contextual bandits and study its theoretical property by proposing statistically efficient algorithms with finite-sample theoretical guarantee. We also connect our theoretical findings with practical algorithms (e.g. DPO, RSO), offering new tools and insights for the algorithmic design of alignment algorithms. Chekc out our technical report. .


I am looking for a research internship position about alignment of language model at summer of 2024!

Selected Projects

(α-β) means that the order is decided by rock-paper-scissors and (*) denotes equal contribution.

  1. Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [Slide]
    Wei Xiong*, Hanze Dong*, Chenlu Ye*, Han Zhong, Nan Jiang, Tong Zhang, Preprint.
    We formulate the real-world RLHF process as a reverse-KL regularized contextual bandits and establish the mathematical foundation of this process by proposing statistically efficient algorithms with finite-sample theoretical guarantee. With a reasonable approximation of some information-theoretical oracle, the results naturally lead to several new alignment algorithms, e.g., the iterative DPO and offline DPO with multi-step rejection sampling, which admit an impressive empirical performance and outperform existing strong baselines like DPO, and RSO in real-world LLM alignment experiments.

  2. A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference [Slide]
    (α-β) Chenlu Ye*, Wei Xiong*, Yuheng Zhang*, Nan Jiang, Tong Zhang, Preprint.
    We present an initial attempt to study the learnability of the KL-regularized NLHF framework, aiming to promote the development of reward-model-free preference learning under general preference oracle.

  3. Mitigating the Alignment Tax of RLHF
    (α-β) Yong Lin*, Hangyu Lin*, Wei Xiong*, Shizhe Diao*, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Yuan Yao, Heng Ji, and Tong Zhang, Preprint.
    We quantitatively study the alignment tax of popular RLHF algorithms including PPO, RAFT, and DPO and investigate various methods to alleviate the forgetting, including regularization, low-rank finetuning, data replay, reawrd regularization, and model averaging. We also propose adaptive model averaging, which is most competitive across different tasks of the benchmark.

  4. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    (α-β) Hanze Dong*, Wei Xiong*, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum and Tong Zhang, TMLR.
    An alternative alignment algorithm for generative models (LLMs and diffusion model) based on reward ranking and supervised fine-tuning. From a high level, RAFT iteratively learns (in an offline learning manner) from the best-of-n policy. As the training is offline and based on supervised learning, it is fast, robust, and memory-efficient (only loading one model at a time). Meanwhile, due to the clear learning objective (the best-of-n policy), one may monitor the filtered dataset to mitigate the reward hacking issue.

  5. LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models
    Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, Tong Zhang, Preprint.
    LMFlow is a framework that allows developing LLMs (fine-tuning, inference, RLHF…) with minimal cost and effort, with 7K+ star in github. I was responsible for developing the RLHF part of the project.

  6. Master thesis: A Sufficient Condition of Sample-Efficient Reinforcement Learning with General Function Approximation
    An introduction to the eluder coefficient, and also a new confidence-set-free algorithmic framework based on optimism, and optimization oracle, which is the counterpart of the posterior sampling framework in the original GEC paper and might be easier to approximate in practice.

  7. GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond [Slide]
    (α-β) Han Zhong*, Wei Xiong*, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang and Tong Zhang, Preprint.
    A unified viewpoint of solvable decision-making problems: limited generalization in an online manner, and a reduction from interactive decision making to the offline supervised learning. Also a unified algorithmic framework for both posterior sampling and UCB algorithm. See my thesis for a more thorough illustration of the ideas. In contrast to the recent remarkable breakthrough of the decoupling technique (decoupling coefficient, and decision-estimation-coefficient), the framework is less general (eluder technique cannot even handle RELU...), but may be easier to approximate in practice with more intuitive algorithmic designs and achieves superior bound in model-free case. Interestingly, the newly proposed measures could be interpreted as a reduction to the online supervised learning, in contrast to the offline one of eluder coefficient.

  8. Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
    Chenlu Ye, Wei Xiong, Quanquan Gu and Tong Zhang, ICML 2023.
    An application of weighted regression in the face of adversarial corruptions: new weight design, and new techniques for controlling the sum of weight (counterpart of the elliptical potential lemmas).

  9. Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: MDP and MG [Slide]
    Wei Xiong*, Han Zhong*, Chengshuai Shi, Cong Shen, Liwei Wang, and Tong Zhang, ICLR 2023.
    An application of weighted regression by using the variance information to achieve a sharper bound.

  10. PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction [Slide]
    Haishan Ye*, Wei Xiong*, and Tong Zhang, Under Minor Revision at TPAMI.
    A unified algorithmic and analysis framework for decentralized variance-reduction algorithm with matching convergence rate of their centralized counterparts.

  11. Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization [Code]
    Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, NeurIPS, 2021.
    We develop a new algorithm BEACON with batch exploration and a carefully crafted communication strategy. BEACON achieves the minimax-optimal regret bound and also impressive empirical performance.


I often learn by copying books…

1 Note on exponential inequality

2 Note on metric entropy

3 An alternative analysis of high-probability generalization bound for uniformly stable algorithms

4 Note on minimax lower bound

5 Note on reduction-based RL

6 Note on non-linear contextual bandit

7 Note on martingale concentration inequality

If you are interested in discussing with me, feel free to drop me an email or add my wechat wei_xiong2000


Email: wx13 AT illinois DOT edu