Blog

I enjoy taking notes on my research and sharing them with others who are also interested in these areas. Thanks to this habit, I have met many like-minded friends who share similar interests.

Large language model

  1. Alignment Guidebook.

    A comprehensive introduction to the preference learning in LLMs.

  2. Why should we do online RLHF/DPO? (in Chinese).

    A blog about the advantage of online exploration in RLHF.

  3. A tutorial of RLHF with LMFlow (Huggingface version).

Reinforcement learning theory

  1. An introduction to eluder coefficient, a unified framework for nearly all known tractable decision-making problems.

  2. Note on reduction-based RL
    A short note on eluder coefficient, decoupling coefficient, and decision-estimation coefficient.

  3. Note on non-linear contextual bandit
    An application of eluder coefficient in contextual bandit with general function approximation.

Optimization basis and Decentralized Optimization

  1. A generic framework for first-order stochastic decentralized optimization

  2. Analysis of batch gradient descent and stochastic gradient descent (in Chinese)

Multi-armed bandit basis and Multi-player multi-armed bandit

  1. Bandits with stochastic rewards (in Chinese)

  2. Lower bound for stochastic bandit (in Chinese)

  3. Adversarial bandit (in Chinese)

  4. UCB2 (in Chinese)

  5. Introduction to multi-player MAB and Synchronisation Involves Communiation Algorithm

  6. Heterogeneous multi-player MAB

  7. Experiments of MAB
    Also check my research page for the GitHub repository