Blog
I enjoy taking notes on my research and sharing them with others who are also interested in these areas. Thanks to this habit, I have met many like-minded friends who share similar interests.
Large language model
Alignment Guidebook.
A comprehensive introduction to the preference learning in LLMs.
Why should we do online RLHF/DPO? (in Chinese).
A blog about the advantage of online exploration in RLHF.
A tutorial of RLHF with LMFlow (Huggingface version).
Reinforcement learning theory
An introduction to eluder coefficient, a unified framework for nearly all known tractable decision-making problems.
Note on reduction-based RL
A short note on eluder coefficient, decoupling coefficient, and decision-estimation coefficient.
Note on non-linear contextual bandit
An application of eluder coefficient in contextual bandit with general function approximation.
Optimization basis and Decentralized Optimization
A generic framework for first-order stochastic decentralized optimization
Analysis of batch gradient descent and stochastic gradient descent (in Chinese)
Multi-armed bandit basis and Multi-player multi-armed bandit
Bandits with stochastic rewards (in Chinese)
Lower bound for stochastic bandit (in Chinese)
Adversarial bandit (in Chinese)
UCB2 (in Chinese)
Introduction to multi-player MAB and Synchronisation Involves Communiation Algorithm
Heterogeneous multi-player MAB
Experiments of MAB Also check my research page for the GitHub repository
|