Blog

I enjoy taking notes on my research and sharing them with others who are also interested in these areas. Thanks to this habit, I have met many like-minded friends who share similar interests.

Large language model

Alignment Guidebook.

A comprehensive introduction to the preference learning in LLMs.
Why should we do online RLHF/DPO? (in Chinese).

A blog about the advantage of online exploration in RLHF.
A tutorial of RLHF with LMFlow (Huggingface version).

Reinforcement learning theory

An introduction to eluder coefficient, a unified framework for nearly all known tractable decision-making problems.
Note on reduction-based RL
A short note on eluder coefficient, decoupling coefficient, and decision-estimation coefficient.
Note on non-linear contextual bandit
An application of eluder coefficient in contextual bandit with general function approximation.

Optimization basis and Decentralized Optimization

Multi-armed bandit basis and Multi-player multi-armed bandit

Bandits with stochastic rewards (in Chinese)
Lower bound for stochastic bandit (in Chinese)
Adversarial bandit (in Chinese)
UCB2 (in Chinese)
Introduction to multi-player MAB and Synchronisation Involves Communiation Algorithm
Heterogeneous multi-player MAB
Experiments of MAB
Also check my research page for the GitHub repository