Wei Xiong

picture 

Ph.D. student

University of Illinois Urbana-Champaign, Computer Science

[Curriculum Vitae]      [Google Scholar]

I am currently a first-year Ph.D. student in computer science at UIUC, working with Prof. Tong Zhang and Prof. Nan Jiang. I also work for Google Deepmind as a student researcher since 2024.5. Prior to this, I received a master's degree in mathematics in 2023 from The Hong Kong University of Science and Technology. I enjoyed several fabulous years at the University of Science and Technology of China and obtained a B.S. in mathematics in 2021, where I worked closely with Prof. Cong Shen.

Research Interests

My research interests focus on reinforcement learning from human feedback (RLHF) for aligning large language model recently. Previously, I have spent time on the mathematical foundation of RL, where I was fortunate to collaborate with many great senior mentors and talented peers. I also spent time on deep RL at Microsoft Research Asia.

My primary research goal is to develop practical and theoretically sound algorithms and make these techniques accessible to the community so that they can be easily used.

News

We formulate the real-world RLHF process as a reverse-KL regularized contextual bandits and study its theoretical property by proposing statistically efficient algorithms with finite-sample theoretical guarantee. We also connect our theoretical findings with practical algorithms (e.g. DPO, RSO), offering new tools and insights for the algorithmic design of alignment algorithms. Check out our technical report. .

Selected Projects

(α-β) means that the order is decided by rock-paper-scissors and (*) denotes equal contribution.

  1. RLHF Workflow: From Reward Modeling to Online RLHF [Code]
    (α-β) Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang, Technical Report
    An open-source recipe to do online RLHF and make the state-of-the-art reward models and chatbot.

  2. Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [Model]
    Haoxiang Wang*, Wei Xiong*, Tengyang Xie, Han Zhao, Tong Zhang, Technical Report
    A multi-head reward model with mixture-of-expert-style aggregation strategy. Best open-source RM of its class!

  3. Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [Slide] [Code]
    Wei Xiong*, Hanze Dong*, Chenlu Ye*, Han Zhong, Nan Jiang, Tong Zhang, ICML 2024, also Oral Presentation at ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
    Online iterative DPO; Mathematical foundation of RLHF as a KL-regularized contextual bandit.

  4. Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards [Slide] [Code]
    Haoxiang Wang*, Yong Lin*, Wei Xiong*, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang, ACL 2024
    A user-preference-aware alignmnet framework under multi-objective reward formulation.

  5. Online Iterative Reinforcement Learning from Human Feedback with General Preference Model [Slide]
    (α-β) Chenlu Ye*, Wei Xiong*, Yuheng Zhang*, Nan Jiang, Tong Zhang, Preprint.
    Learnability of RLHF under general reward-model-free preference structure.

  6. Mitigating the Alignment Tax of RLHF
    (α-β) Yong Lin*, Hangyu Lin*, Wei Xiong*, Shizhe Diao*, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Yuan Yao, Heng Ji, and Tong Zhang, Preprint.
    Model merge for mitigating alignment tax.

  7. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Code]
    (α-β) Hanze Dong*, Wei Xiong*, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum and Tong Zhang, TMLR.
    We propose RAFT to iteratively learn from best-of-n sampling. It is also known as rejection sampling fine-tuning.

  8. LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models [Code]
    Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, Tong Zhang, NAACL 2024
    A framework to develop LLMs (fine-tuning, inference, RLHF…) with 8K+ star in github. I was responsible for developing the RLHF part of the project.

  9. Master thesis: A Sufficient Condition of Sample-Efficient Reinforcement Learning with General Function Approximation
    An introduction to eluder coefficient, and also a new confidence-set-free algorithmic framework based on optimism (long version accepted to NeurIPS).

  10. GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond [Slide]
    (α-β) Han Zhong*, Wei Xiong*, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang and Tong Zhang, Preprint.
    A unified viewpoint of solvable decision-making problems: limited generalization in an online manner, and a reduction from interactive decision making to the offline supervised learning. Also a unified algorithmic framework for both posterior sampling and UCB algorithm.

  11. Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
    Chenlu Ye, Wei Xiong, Quanquan Gu and Tong Zhang, ICML 2023.
    Weighted regression in the face of adversarial corruptions: new weight design, and new techniques for controlling the sum of weight (counterpart of the elliptical potential lemmas).

  12. Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: MDP and MG [Slide]
    Wei Xiong*, Han Zhong*, Chengshuai Shi, Cong Shen, Liwei Wang, and Tong Zhang, ICLR 2023.
    An application of weighted regression by using the variance information to achieve a sharper bound.

  13. PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction [Slide]
    Haishan Ye*, Wei Xiong*, and Tong Zhang, Under Minor Revision at TPAMI.
    A unified algorithmic and analysis framework for decentralized variance-reduction algorithm with matching convergence rate of their centralized counterparts.

  14. Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization [Code]
    Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, NeurIPS, 2021.
    A new algorithm BEACON with batch exploration and a carefully crafted communication strategy: minimax-optimal regret bound and also impressive empirical performance.

Misc

I take notes while learning …

1 Note on exponential inequality

2 Note on metric entropy

3 An alternative analysis of high-probability generalization bound for uniformly stable algorithms

4 Note on minimax lower bound

5 Note on reduction-based RL

6 Note on non-linear contextual bandit

7 Note on martingale concentration inequality

If you are interested in discussing with me, feel free to drop me an email or add my wechat wei_xiong2000

Contact

Email: wx13 AT illinois DOT edu