Research

Papers and Preprints

Building Math Agents with Multi-Turn Iterative Preference Learning [Code]
Wei Xiong, Chengshuai Shi, Jiaming Shen, Aviv Rosenberg, Zhen Qin, Daniele Calandriello, Misha Khalman, Rishabh Joshi, Bilal Piot, Mohammad Saleh, Chi Jin, Tong Zhang, Tianqi Liu, Technical report
Agent learning with iterative multi-turn DPO or KTO.
RLHF Workflow: From Reward Modeling to Online RLHF [Code]
(α-β) Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang, TMLR
An open-source recipe to do online RLHF and make the state-of-the-art reward models and chatbot.
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [Model] [Code]
Haoxiang Wang*, Wei Xiong*, Tengyang Xie, Han Zhao, Tong Zhang, EMNLP 2024
A multi-head reward model with mixture-of-expert-style aggregation strategy. Best open-source RM of its class!
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [Slide] [Code]
Wei Xiong*, Hanze Dong*, Chenlu Ye*, Han Zhong, Nan Jiang, Tong Zhang, ICML 2024, also Oral Presentation at ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
Online iterative DPO; Mathematical foundation of RLHF as a KL-regularized contextual bandit.
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards [Slide] [Code]
Haoxiang Wang*, Yong Lin*, Wei Xiong*, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang, ACL 2024
A user-preference-aware alignmnet framework under multi-objective reward formulation.
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model [Slide]
(α-β) Chenlu Ye*, Wei Xiong*, Yuheng Zhang*, Nan Jiang, Tong Zhang, Preprint.
Learnability of RLHF under general reward-model-free preference structure.
Mitigating the Alignment Tax of RLHF [Code]
(α-β) Yong Lin*, Hangyu Lin*, Wei Xiong*, Shizhe Diao*, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Yuan Yao, Heng Ji, and Tong Zhang, EMNLP 2024.
Model merge for mitigating alignment tax.
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Code]
(α-β) Hanze Dong*, Wei Xiong*, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum and Tong Zhang, TMLR.
We propose RAFT to iteratively learn from best-of-n sampling. It is also known as rejection sampling fine-tuning.
A Sufficient Condition of Sample-Efficient Reinforcement Learning with General Function Approximation
Wei Xiong, Master thesis
One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration
Zhihan Liu*, Miao Lu*, Wei Xiong*, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang, NeurIPS 2023
GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond [Slide]
(α-β) Han Zhong*, Wei Xiong*, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang and Tong Zhang, Preprint.
Reward Teaching for Federated Multi-Armed Bandits
Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, IEEE Transaction on Signal Processing.
A short version accepted by IEEE International Symposium on Information Theory (ISIT 2023).
Provably Efficient Offline Reinforcement Learning with Perturbed Data Sources
Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, ICML 2023.
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
Chenlu Ye, Wei Xiong, Quanquan Gu and Tong Zhang, ICML 2023.
Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game [Slide]
Wei Xiong*, Han Zhong*, Chengshuai Shi, Cong Shen, Liwei Wang, and Tong Zhang, ICLR 2023.
A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Game
Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, and Tong Zhang, ICML 2022.
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
Han Zhong*, Wei Xiong*, Jiyuan Tan*, Liwei Wang, Tong Zhang, Zhaoran Wang, and Zhuoran Yang, ICML 2022.
An Alternative Analysis of High-Probability Generalization Bound for Uniformly Stable Algorithms
Wei Xiong, Yong Lin, and Tong Zhang, Project Report, Not intended for publication.
PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction [Slide]
Haishan Ye*, Wei Xiong*, and Tong Zhang, Under Minor Revision at TPAMI.
Algorithms of Multi-player Multi-armed Bandits under Different Settings
Wei Xiong, Undergraduate Thesis (in Chinese).
Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization [Code]
Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, NeurIPS, 2021.
(Almost) Free Incentivized Exploration from Decentralized Learning Agents [Code]
Chengshuai Shi, Haifeng Xu, Wei Xiong, and Cong Shen, NeurIPS, 2021.
Distributional Reinforcement Learning for Multi-Dimensional Reward Functions
Pushi Zhang, Xiaoyu Chen, Li Zhao, Wei Xiong, Tao Qin, and Tie-Yan Liu, NeurIPS, 2021.
Decentralized multi-player multi-armed bandits with no collision information [Code*]
Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang, AISTATS, 2020.

(* equal contribution or alphabetical order)

The Code* implementes many SOTA and baseline MPMAB algorithms, which is a nice work of Cindy Trinh.