Two-Step Offline Preference-Based
Reinforcement Learning with Constrained Actions

Yinglun Xu1,   Tarun Suresh1,   Rohan Gumaste1,   David Zhu1,   Ruirui Li2,   
Zhengyang Wang2,   Haoming Jiang2,   Xianfeng Tang2,   Qingyu Yin2,  
Monica Xiao Cheng2,   Qi Zeng1,   Chao Zhang3,   Gagandeep Singh1
1University of Illinois Urbana-Champaign, 2Amazon, 3Georgia institute of technology
Work done during an internship at Amazon. Corresponding author: yinglun6@illinois.edu
Abstract

Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

1 Introduction

Deep Reinforcement Learning (DRL) is a learning paradigm for solving sequential decision-making problems and has rich real-world applications (Sutton & Barto, 2018; Luong et al., 2019; Haydari & Yılmaz, 2020). However, traditional DRL approaches require numerical reward feedback during learning, which in practice can be hard to design or obtain. In contrast, non-numerical preference feedback is more accessible in most cases (Wirth & Fürnkranz, 2013). As a result, preference-based reinforcement learning (PBRL) that only requires preference feedback has become a more realistic learning paradigm (Christiano et al., 2017). Recently, as a special instance of PBRL, reinforcement learning from human feedback (RLHF) that utilizes human preference has drawn much attention and achieved great success in many NLP tasks (Ouyang et al., 2022). Although a majority of current PBRL works focus on the online learning setting (Ibarz et al., 2018; Lee et al., 2021), the PBRL learning in the offline setting is more realistic Rafailov et al. (2023). In the online learning setting, the learning agent continuously interacts with the learning environment to collect new preference data throughout the training process; In the offline learning setting, the agent receives a batch of preference data before training starts. Compared to the online learning setting, the offline learning setting is more accessible as it does not require customized feedback throughout the training process. It is critical to investigate how to make PBRL learning efficient and practical in the offline setting. Particularly, we are interested in the most popular learning framework that is widely applied in various PBRL literature and industrial applications, which we call ‘two-step learning.’

Two-step learning framework: In a typical two-step learning framework (Ibarz et al., 2018; Christiano et al., 2017), the first learning step is reward modeling, where the learning agent approximates the reward model of the environment that can best explain the observations from the training dataset. The second learning step is a reinforcement learning phase where the learning agent learns a policy based on the reward model acquired in the previous step. There are multiple reasons behind the vast popularity of the two-step learning paradigm: 1. similar to the motivation of inverse reinforcement learning (Arora & Doshi, 2021), the learned environment model provides a succinct and transferable definition of the task; 2. the method is modular in that each phase of learning can be implemented by existing well-developed methods. Despite its popularity, this method faces two main challenges.

  • Reward over-optimization/Pessimistic learning: Pessimistic learning has always been the key challenge in offline learning settings. Due to the distribution mismatch between the trajectories induced by the policy and that of the dataset, the agent cannot properly evaluate all possible possible policies.

  • Reinforcement learning complexity: Implementing Reinforcement learning efficiently is known to be challenging Dulac-Arnold et al. (2021). In addition, for preference feedback, the reward model is learned from indirect preference signals rather than actual true rewards. The learned reward model may look very different from the actual reward, making the training step more unstable.

The challenges are empirically verified in Section 5. To solve the first challenge, the most popular approach is to apply a penalty on the evaluated performance of a policy during the reinforcement learning step. The penalty is the KL divergence between a policy and the behavior policy that can best describe the training data distribution. Under the penalty, the agent is encouraged to stay close to the dataset distribution while trying to find a policy that optimizes the learned reward model. However, the reinforcement learning problem in this problem may not be easy to solve due to the second challenge, which is empirically verified in Section 5.

The approaches to solving the second challenge mainly focus on the bandit setting for NLP applications. The bandit setting is a special case of RL that has no state transition. Current methods include direct preference alignment Rafailov et al. (2023) and rejection sampling Liu et al. (2023), both of which cannot be directly applied to the general RL setting. To the best of our knowledge, no existing two-phase learning approach considers tackling both challenges simultaneously.

Our contributions. In this work, we propose a novel two-step learning PBRL algorithm PRC that tackles the above two main challenges simultaneously. Our key insight is that both challenges are induced by the same element: the state actions that are out of the dataset distribution. For pessimistic learning, the agent must avoid the policies that are more likely to visit such state-actions. These policies are not well covered by the dataset, and the agent cannot evaluate their performance accurately. Meanwhile, these out-of-distribution state-actions contribute to the complexity of the learning problem. Therefore, our insight is to constrain the action space to exclude all such out-of-distribution state-actions. Based on the insight, the key step in our approach is to constrain the action space to include only the actions with a high probability of being sampled by a behavior clone policy that represents the dataset action distribution. In other words, our method only focuses on the policies that are in the neighborhood of the behavior clone policy. As a result, our method not only inherits the pessimistic idea of behavior regularization but also reduces the complexity of the corresponding reinforcement learning problem. In the experiments, we construct offline PBRL datasets based on standard offline RL benchmark D4RL Fu et al. (2020). The details of dataset construction can be found in Section 5. We empirically verify that on the offline datasets we construct, our approach can always find policies of higher performance than the behavior policies of the dataset, suggesting that the improvement in our PBRL training is reliable.

2 Related Work

Offline Reinforcement learning: There have been many studies on the standard offline reinforcement learning setting with reward feedback (Levine et al., 2020; Cheng et al., 2022; Yu et al., 2021), in contrast to preference feedback considered in our setting. The key idea behind offline reinforcement learning is pessimism which encourages the learning agent to focus on the policies that are supported by the dataset (Cheng et al., 2022). However, the gap between learning from reward feedback and preference feedback is not clear at present.

Online PBRL: Preference-based reinforcement learning is popular in the online learning scenario. Various types of preference models and preference labels have been considered in the tabular case (Fürnkranz et al., 2012; Wirth et al., 2017). Similar to the works that consider the general case, in our work we consider the preference models to be Bradley-Terry models and output binary preference labels (Ibarz et al., 2018; Lee et al., 2021). The two-phase learning approach mentioned in Section 1 is prevalent in the online PBRL setting. Note that directly applying online methods to the offline scenario is also likely to be inefficient even when it is possible (Van Hasselt et al., 2018), as the online learning approaches do not need to be pessimistic.

Offline PBRL: Offline preference-based reinforcement learning is a relatively new topic. Zhan et al. (2023) propose an optimization problem following the two-phase learning framework and theoretically prove that the solution to the problem is a near-optimal policy with respect to the training dataset. However, how to solve the optimization problem in practice remains unknown. Concurrently, Zhu et al. (2023) formulates a similar optimization problem under the linear assumption on the environment and proposes a way to solve the problem, but the results cannot be extended to the general case without the linear assumption. Kim et al. (2023) study methods for learning the utility models based on preference provided by humans. They propose to use a transformer-based architecture to predict the utility function behind the preference model. Rafailov et al. (2023) propose a method that directly learns a policy without learning a utility model explicitly. Unlike the setting considered in our work, they focus on the NLP task and require trajectories in each pair to have the same prompt (initial state). Hejna & Sadigh (2023) propose a method that learns a policy and the Q function of the environment instead of the utility function. They also require access to an additional trajectory dataset that only contains trajectories to make their algorithm efficient. Our work focuses on developing a practical and efficient two-phase learning approach for the offline PBRL setting that only has access to a preference dataset with no requirement on the trajectories in the dataset.

RLHF: Reinforcement learning from human feedback is a special instance of PBRL that its preference labels are provided by humans. It has been a popular topic recently. Success in various robotic control and NLP taasks have been achieved by fine-tuning pre-trained models through reinforcement learning based on preference feedback from human (Ouyang et al., 2022; Bai et al., 2022). These studies focus on solving specific tasks instead of the general problem considered in our work.

3 Preliminary

3.1 Offline Preference Reinforcement Learning Problem

We consider an offline preference-based reinforcement learning setting (Zhan et al., 2023). An environment is characterized by an incomplete Markov Decision Process (MDP) without a reward function =(𝒮,𝒜,𝒫)𝒮𝒜𝒫\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P})caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P ) where 𝒮𝒮\mathcal{S}caligraphic_S is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the action space, and 𝒫𝒫\mathcal{P}caligraphic_P is the state transition function. A policy π:𝒮Δ(𝒜):𝜋𝒮Δ𝒜\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})italic_π : caligraphic_S → roman_Δ ( caligraphic_A ) is a mapping from a state to a probability distribution over the action space, representing a way to behave in the environment by taking the action at a state. A deterministic policy is a special kind of policy that maps a state to a single action. We use Π𝒮,𝒜subscriptΠ𝒮𝒜\Pi_{\mathcal{S},\mathcal{A}}roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A end_POSTSUBSCRIPT to denote the class of all policies, and Π𝒮,𝒜DsubscriptsuperscriptΠ𝐷𝒮𝒜\Pi^{D}_{\mathcal{S},\mathcal{A}}roman_Π start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S , caligraphic_A end_POSTSUBSCRIPT to denote the class of deterministic policies. A trajectory of length t𝑡titalic_t is a sequence of states and actions (s1,a1,,st,at,st+1)s_{1},a_{1},\ldots,s_{t},a_{t},s_{t+1})italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) where aiA,si+1P(|si,ai)i[t]a_{i}\in A,s_{i+1}\sim P(\cdot|s_{i},a_{i})\forall i\in[t]italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∀ italic_i ∈ [ italic_t ]. A preference model F𝐹Fitalic_F takes a pair of trajectories τ1,τ2subscript𝜏1subscript𝜏2\tau_{1},\tau_{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as input, and outputs a preference σ{,}𝜎succeedsprecedes\sigma\in\{\succ,\prec\}italic_σ ∈ { ≻ , ≺ }, indicating its preference over the two trajectories, i.e. τ1τ2succeedssubscript𝜏1subscript𝜏2\tau_{1}\succ\tau_{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or τ1τ2precedessubscript𝜏1subscript𝜏2\tau_{1}\prec\tau_{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In this work, following the conventions, we consider the preference models that are Bradley-Terry (BT) models (Bradley & Terry, 1952). Specifically, there exists an utility function u:𝒮×𝒜:𝑢𝒮𝒜u:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_u : caligraphic_S × caligraphic_A → blackboard_R, such that the probability distribution of the output preference of F𝐹Fitalic_F given τ1,τ2subscript𝜏1subscript𝜏2\tau_{1},\tau_{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfies:

Pr{τ1τ2}=exp((s,a)τ1u(s,a))exp((s,a)τ1u(s,a))+exp((s,a)τ2u(s,a))Prsucceedssubscript𝜏1subscript𝜏2subscript𝑠𝑎subscript𝜏1𝑢𝑠𝑎subscript𝑠𝑎subscript𝜏1𝑢𝑠𝑎subscript𝑠𝑎subscript𝜏2𝑢𝑠𝑎\Pr\{\tau_{1}\succ\tau_{2}\}=\frac{\exp(\sum_{(s,a)\in\tau_{1}}u(s,a))}{\exp(% \sum_{(s,a)\in\tau_{1}}u(s,a))+\exp(\sum_{(s,a)\in\tau_{2}}u(s,a))}roman_Pr { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = divide start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) end_ARG start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) + roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) end_ARG (1)

We define the performance of a policy π𝜋\piitalic_π on a utility function u𝑢uitalic_u as the expected cumulative utility of a trajectory generated by the policy: 𝔼τ(π,𝒫)[s,aτu(s,a)]subscript𝔼similar-to𝜏𝜋𝒫delimited-[]subscript𝑠𝑎𝜏𝑢𝑠𝑎\mathbb{E}_{\tau\sim(\pi,\mathcal{P})}[\sum_{s,a\in\tau}u(s,a)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , caligraphic_P ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_s , italic_a ∈ italic_τ end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ]. An offline dataset D={(τ11,τ21,σ1),,(τ1N,τ2N,σN)}𝐷subscriptsuperscript𝜏11subscriptsuperscript𝜏12superscript𝜎1subscriptsuperscript𝜏𝑁1subscriptsuperscript𝜏𝑁2superscript𝜎𝑁D=\{(\tau^{1}_{1},\tau^{1}_{2},\sigma^{1}),\ldots,(\tau^{N}_{1},\tau^{N}_{2},% \sigma^{N})\}italic_D = { ( italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , … , ( italic_τ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) } consists of multiple preference data over different trajectory pairs. We use the term ‘behavior policy’ to represent how the trajectories are generated in the dataset. A learning agent is given access to the incomplete MDP =(𝒮,𝒜,𝒫)𝒮𝒜𝒫\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P})caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P ) of the environment and a dataset D𝐷Ditalic_D whose preferences are generated by a preference model F𝐹Fitalic_F. The learning goal is to learn a policy that has high performance on the utility function u𝑢uitalic_u of the preference model F𝐹Fitalic_F, and we say a learning algorithm is efficient if it can learn such a high-performing policy.

3.2 Traditional PPO Learning with KL-regularization

Here, we introduce the prevalent two-step learning framework that has been widely applied for solving PBRL problems Christiano et al. (2017) and is closely related to our algorithm in this work. In general, the algorithm aims at solving the optimization problem below:

argmaxπΠ𝒮,𝒜𝔼τ(π,𝒫)(s,a)τ[u^(s,a)]αKL(π,πb)subscriptargmax𝜋subscriptΠ𝒮𝒜subscript𝔼similar-to𝜏𝜋𝒫subscript𝑠𝑎𝜏delimited-[]^𝑢𝑠𝑎𝛼KL𝜋subscript𝜋𝑏\displaystyle\operatorname*{arg\,max}_{\pi\in\Pi_{\mathcal{S},\mathcal{A}}}% \mathbb{E}_{\tau\sim(\pi,\mathcal{P})}\sum_{(s,a)\in\tau}[\hat{u}(s,a)]-\alpha% \cdot\text{KL}(\pi,\pi_{b})start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , caligraphic_P ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT [ over^ start_ARG italic_u end_ARG ( italic_s , italic_a ) ] - italic_α ⋅ KL ( italic_π , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) (2)
s.t. u^=argminuu(u,D)^𝑢subscript𝑢subscript𝑢𝑢𝐷\displaystyle\hat{u}=\arg\min_{u}\mathcal{L}_{u}(u,D)over^ start_ARG italic_u end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_u , italic_D )
πb=argminπΠbπ(π,D)subscript𝜋𝑏subscript𝜋subscriptΠ𝑏subscript𝜋𝜋𝐷\displaystyle\pi_{b}=\arg\min_{\pi\in\Pi_{b}}\mathcal{L}_{\pi}(\pi,D)italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_π , italic_D )

Here, α𝛼\alphaitalic_α is a parameter to control the degree of pessimism. u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG is a learned reward model and πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is a behavior imitation policy. u(u,D)subscript𝑢𝑢𝐷\mathcal{L}_{u}(u,D)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_u , italic_D ) is a loss function to evaluate the quality of the utility model on interpreting the preference labels in the dataset. π(u,D)subscript𝜋𝑢𝐷\mathcal{L}_{\pi}(u,D)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_u , italic_D ) is a loss function to evaluate how well the behavior imitation policy reproduces the behavior demonstrated by the dataset.

To solve the optimization problem, the first step is to find the optimal reward model and behavior imitation policy that minimize their corresponding losses. The second step is to use a reinforcement learning algorithm such as PPO to solve for the policy that maximizes the optimization goal. Pessimism is achieved through the KL regularization in the optimization goal, which encourages the policy not to be very different from the behavior imitation policy. Formally, Algorithm 1 below represents a typical way to solve the optimization problem in Eq 2.

Inputs : Environment (𝒮,𝒜,𝒫)𝒮𝒜𝒫(\mathcal{S},\mathcal{A},\mathcal{P})( caligraphic_S , caligraphic_A , caligraphic_P ), Dataset D𝐷Ditalic_D
1. Train a utility function u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG on D𝐷Ditalic_D that minimize u(,D)subscript𝑢𝐷\mathcal{L}_{u}(\cdot,D)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ⋅ , italic_D ) to approximate the preference model in D𝐷Ditalic_D.
2. Train a policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from a class of policy ΠbsubscriptΠ𝑏\Pi_{b}roman_Π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT that minimizes π(π,D)subscript𝜋𝜋𝐷\mathcal{L}_{\pi}(\pi,D)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_π , italic_D ) to approximate the behavior policy behind D𝐷Ditalic_D.
3. Find the policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that optimizes π=argmaxπΠ𝒮,𝒜𝔼τ(π,P)[(s,a)τu^(s,a)]αKL(π,πb)superscript𝜋subscript𝜋subscriptΠ𝒮𝒜subscript𝔼similar-to𝜏𝜋𝑃delimited-[]subscript𝑠𝑎𝜏^𝑢𝑠𝑎𝛼KL𝜋subscript𝜋𝑏\pi^{*}=\arg\max_{\pi\in\Pi_{\mathcal{S},\mathcal{A}}}\mathbb{E}_{\tau\sim(\pi% ,P)}[\sum_{(s,a)\in\tau}\hat{u}(s,a)]-\alpha\cdot\text{KL}(\pi,\pi_{b})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , italic_P ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG ( italic_s , italic_a ) ] - italic_α ⋅ KL ( italic_π , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )
4. Output πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Algorithm 1 Typical two-step training with KL-regularization

4 Preference Based Reinforcement Learning on a Constrained Action Space (PRC)

4.1 General PRC Algorithm

First, we introduce our main algorithm PRC. In Alg 2, we formally show the general form of PRC.

Inputs : Environment (𝒮,𝒜,𝒫)𝒮𝒜𝒫(\mathcal{S},\mathcal{A},\mathcal{P})( caligraphic_S , caligraphic_A , caligraphic_P ), Dataset D𝐷Ditalic_D
1. Train a utility function u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG on D𝐷Ditalic_D that minimize u(,D)subscript𝑢𝐷\mathcal{L}_{u}(\cdot,D)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ⋅ , italic_D ) to approximate the preference model in D𝐷Ditalic_D.
2. Train a policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from a class of policy ΠbsubscriptΠ𝑏\Pi_{b}roman_Π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT that minimizes π(π,D)subscript𝜋𝜋𝐷\mathcal{L}_{\pi}(\pi,D)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_π , italic_D ) to approximate the behavior policy behind D𝐷Ditalic_D.
3. Construct a clipped action space 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that at a state s𝑠sitalic_s, the clipped action space is 𝒜|s={a:π0(a|s)p}conditionalsuperscript𝒜𝑠conditional-set𝑎subscript𝜋0conditional𝑎𝑠𝑝\mathcal{A}^{\prime}|s=\{a:\pi_{0}(a|s)\geq p\}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s = { italic_a : italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | italic_s ) ≥ italic_p }.
4. Find the policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT supported on the clipped action space that optimizes π=argmaxπΠ𝒮,𝒜𝔼τ(π,P)[(s,a)τu^(s,a)]superscript𝜋subscript𝜋subscriptΠ𝒮superscript𝒜subscript𝔼similar-to𝜏𝜋𝑃delimited-[]subscript𝑠𝑎𝜏^𝑢𝑠𝑎\pi^{*}=\arg\max_{\pi\in\Pi_{\mathcal{S},\mathcal{A}^{\prime}}}\mathbb{E}_{% \tau\sim(\pi,P)}[\sum_{(s,a)\in\tau}\hat{u}(s,a)]italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , italic_P ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG ( italic_s , italic_a ) ]
4. Output πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Algorithm 2 Preference Based Reinforcement Learning on a Constrained Action Space

Here, we abuse the notation of π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ). If the action space is discrete, then π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) represents the probability of the policy to generate the response a𝑎aitalic_a at the state s𝑠sitalic_s. If the action space is continuous, the π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) becomes the probability density. In the first step, the learning agent learns a behavior policy that can best imitate the behavior demonstrated in the dataset and then learns a reward model that best interprets the dataset’s preference label. This is the same as the first step in the typical two-step learning framework Alg 1. In the second learning step, the agent learns the policy supported on a clipped action space that maximizes the cumulative return on the learned reward model. At this step, the learning agent only considers the high probability actions according to the behavior policy.

Formally, let the incomplete MDP of the environment be (𝒮,𝒜,𝒫)𝒮𝒜𝒫(\mathcal{S},\mathcal{A},\mathcal{P})( caligraphic_S , caligraphic_A , caligraphic_P ). Let the behavior policy and reward model learned in the first step be π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and \mathcal{R}caligraphic_R respectively. We define a constrained action space 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT based 𝒜𝒜\mathcal{A}caligraphic_A and π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let 𝒜|sconditionalsuperscript𝒜𝑠\mathcal{A}^{\prime}|scaligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s be the set of actions 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at a state s𝑠sitalic_s. The constrained action space 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT consists of all actions that have a probability higher than a threshold to be sampled from the behavior policy at a state s𝑠sitalic_s: 𝒜|s={a:π0(a|s)p}conditionalsuperscript𝒜𝑠conditional-set𝑎subscript𝜋0conditional𝑎𝑠𝑝\mathcal{A}^{\prime}|s=\{a:\pi_{0}(a|s)\geq p\}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s = { italic_a : italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | italic_s ) ≥ italic_p }. Then, at the second learning step, the agent solves an RL problem where the MDP is (𝒮,𝒜,𝒫,)𝒮superscript𝒜𝒫(\mathcal{S},\mathcal{A}^{\prime},\mathcal{P},\mathcal{R})( caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_P , caligraphic_R ).

4.2 Analysis

Here, we show that the reinforcement learning step in PRC is essentially optimizing the reward function under a special behavior regularization constraint. Recap the prevalent PBRL framework in Alg 1. The pessimism in the algorithm is achieved by using KL regularization, which encourages the policy to be not very different from the behavior policy. Here, we consider a different regularization as below.

Cp(π,πb)={if (s,a)(𝒮,𝒜),π(a|s)>0,πb(a|s)<p0,otherwisesubscript𝐶𝑝𝜋subscript𝜋𝑏casesformulae-sequenceif 𝑠𝑎𝒮𝒜formulae-sequence𝜋conditional𝑎𝑠0subscript𝜋𝑏conditional𝑎𝑠𝑝0otherwiseC_{p}(\pi,\pi_{b})=\begin{cases}-\infty&\text{if }\exists(s,a)\in(\mathcal{S},% \mathcal{A}),\pi(a|s)>0,\pi_{b}(a|s)<p\\ 0,&\text{otherwise}\end{cases}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = { start_ROW start_CELL - ∞ end_CELL start_CELL if ∃ ( italic_s , italic_a ) ∈ ( caligraphic_S , caligraphic_A ) , italic_π ( italic_a | italic_s ) > 0 , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_a | italic_s ) < italic_p end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW

The optimization problem under such a regularization is argmaxπΠ𝒮,𝒜𝔼τ(π,𝒫)(s,a)τ[u^(s,a)]αCp(π,πb)subscriptargmax𝜋subscriptΠ𝒮𝒜subscript𝔼similar-to𝜏𝜋𝒫subscript𝑠𝑎𝜏delimited-[]^𝑢𝑠𝑎𝛼subscript𝐶𝑝𝜋subscript𝜋𝑏\operatorname*{arg\,max}_{\pi\in\Pi_{\mathcal{S},\mathcal{A}}}\mathbb{E}_{\tau% \sim(\pi,\mathcal{P})}\sum_{(s,a)\in\tau}[\hat{u}(s,a)]-\alpha\cdot C_{p}(\pi,% \pi_{b})start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , caligraphic_P ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT [ over^ start_ARG italic_u end_ARG ( italic_s , italic_a ) ] - italic_α ⋅ italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). Under such a regularization, the optimal policy can only be supported in a constrained action space where the probability density of the behavior policy is greater than the threshold p𝑝pitalic_p, as otherwise, it will suffer from an infinite penalty from the regularization.

Next, we discuss why choosing such a regularization is reasonable. First, unlike the soft constraint, such as KL regularization, our regularization is a hard constraint that forces the policy to stay close to the behavior policy. It is conceptually more conservative in that it explicitly decreases the possibility of reward hacking by outputting some policy that is not close to the behavior policy. Second, it reduces the complexity of finding the optimal policy under the regularization. This can make the reinforcement learning step more accessible and reduce the optimization loss at this step.

4.3 Practical Implementation

Here, we introduce a practical implementation of our PRC algorithm, which is later used in our experiments. First, we can train a deterministic policy π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that has a high probability of reproducing the behavior demonstrated by the dataset. The action given by the deterministic policy can be thought of as the center of the distribution of the actual underlying behavior policy of the dataset. Since we have no prior knowledge about the shape of the distribution of the behavior policy, we may only infer that an action is more likely to be sampled by the behavior policy if it is closer to the center of its distribution. Then, we can approximate the space of 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as a box whose center is at the deterministic policy. The radius of the action space is a hyper-parameter related to the degree of pessimism.

Formally, Alg 3 is a practical implementation for PRC algorithm. Lines 1-2 learn a utility model and a behavior clone policy, which a standard supervised learning approach can achieve. The loss functions are set in Eq 3 below. In section 5, we empirically verify that training on a constrained action space is an efficient pessimistic learning approach, and the reinforcement learning is also much easier on a constrained action space.

u(u,D)subscript𝑢𝑢𝐷\displaystyle\mathcal{L}_{u}(u,D)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_u , italic_D ) =(τ1,τ2,σ)Dlogexp((s,a)τu(s,a))exp((s,a)τ1u(s,a))+exp((s,a)τ2u(s,a))absentsubscriptsubscript𝜏1subscript𝜏2𝜎𝐷subscript𝑠𝑎superscript𝜏𝑢𝑠𝑎subscript𝑠𝑎subscript𝜏1𝑢𝑠𝑎subscript𝑠𝑎subscript𝜏2𝑢𝑠𝑎\displaystyle=-\sum_{(\tau_{1},\tau_{2},\sigma)\in D}\log\frac{\exp(\sum_{(s,a% )\in\tau^{*}}u(s,a))}{\exp(\sum_{(s,a)\in\tau_{1}}u(s,a))+\exp(\sum_{(s,a)\in% \tau_{2}}u(s,a))}= - ∑ start_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ ) ∈ italic_D end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) end_ARG start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) + roman_exp ( ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_s , italic_a ) ) end_ARG (3)
π(π,D)subscript𝜋𝜋𝐷\displaystyle\mathcal{L}_{\pi}(\pi,D)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_π , italic_D ) =(s,a)Dπ(s)a2absentsubscriptsimilar-to𝑠𝑎𝐷subscriptnorm𝜋𝑠𝑎2\displaystyle=\sum_{(s,a)\sim D}\|\pi(s)-a\|_{2}= ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_D end_POSTSUBSCRIPT ∥ italic_π ( italic_s ) - italic_a ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Lines 3-5 implement an RL environment with a constrained action space. For a policy πclipsubscript𝜋clip\pi_{\text{clip}}italic_π start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT supported on 𝒮𝒮\mathcal{S}caligraphic_S, 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, if it outputs an action a𝒜superscript𝑎superscript𝒜a^{\prime}\in\mathcal{A}^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then its corresponding actual policy outputs action f(s,a)𝑓𝑠superscript𝑎f(s,a^{\prime})italic_f ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and the state transition follows 𝒫(s,f(s,a))=𝒫(s,a)𝒫𝑠𝑓𝑠superscript𝑎superscript𝒫𝑠superscript𝑎\mathcal{P}(s,f(s,a^{\prime}))=\mathcal{P}^{\prime}(s,a^{\prime})caligraphic_P ( italic_s , italic_f ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where 𝒫(s,a)superscript𝒫𝑠superscript𝑎\mathcal{P}^{\prime}(s,a^{\prime})caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the state transition in Line 5. Therefore, we can search for the optimal clipped policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from Π𝒮,𝒜subscriptΠ𝒮superscript𝒜\Pi_{\mathcal{S},\mathcal{A}^{\prime}}roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT on the MDP with state space 𝒮𝒮\mathcal{S}caligraphic_S, action space 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, state transition 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and output the corresponding actual policy of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In practice, we can solve this by applying the SAC or PPO algorithm on the MDP =(𝒮,𝒜,𝒫,u^)superscript𝒮superscript𝒜superscript𝒫^𝑢\mathcal{M}^{\prime}=(\mathcal{S},\mathcal{A}^{\prime},\mathcal{P}^{\prime},% \hat{u})caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_u end_ARG ).

Inputs : Environment (𝒮,𝒜,𝒫)𝒮𝒜𝒫(\mathcal{S},\mathcal{A},\mathcal{P})( caligraphic_S , caligraphic_A , caligraphic_P ), Dataset D𝐷Ditalic_D
Parameters : Positive real number r𝑟ritalic_r
1. Train a utility function u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG that minimizes u(,D)subscript𝑢𝐷\mathcal{L}_{u}(\cdot,D)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ⋅ , italic_D ) to approximate the preference model in D𝐷Ditalic_D.
2. Train a deterministic behavior clone policy πb0superscriptsubscript𝜋𝑏0\pi_{b}^{0}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT that minimizes π(,D)subscript𝜋𝐷\mathcal{L}_{\pi}(\cdot,D)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( ⋅ , italic_D ).
3. Construct a constrained action space 𝒜=Nsuperscript𝒜superscript𝑁\mathcal{A}^{\prime}=\mathbb{R}^{N}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT where N𝑁Nitalic_N is the dimensions of 𝒜𝒜\mathcal{A}caligraphic_A. Each dimension is constrained on [r,r]𝑟𝑟[-r,r][ - italic_r , italic_r ].
4. Construct a mapping f:𝒮×𝒜𝒜:𝑓𝒮superscript𝒜𝒜f:\mathcal{S}\times\mathcal{A}^{\prime}\rightarrow\mathcal{A}italic_f : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → caligraphic_A as f(s,a)=Proj𝒜(πb0(s)+a)𝑓𝑠superscript𝑎subscriptProj𝒜superscriptsubscript𝜋𝑏0𝑠superscript𝑎f(s,a^{\prime})=\text{Proj}_{\mathcal{A}}(\pi_{b}^{0}(s)+a^{\prime})italic_f ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = Proj start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) + italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
5. Construct a state transition 𝒫:𝒮×𝒜Δ(𝒮):superscript𝒫𝒮superscript𝒜Δ𝒮\mathcal{P}^{\prime}:\mathcal{S}\times\mathcal{A}^{\prime}\rightarrow\Delta(% \mathcal{S})caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → roman_Δ ( caligraphic_S ) as 𝒫(s,a)=𝒫(s,f(s,a))superscript𝒫𝑠superscript𝑎𝒫𝑠𝑓𝑠superscript𝑎\mathcal{P}^{\prime}(s,a^{\prime})=\mathcal{P}(s,f(s,a^{\prime}))caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = caligraphic_P ( italic_s , italic_f ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
6. Find the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that optimizes argmaxπΠ𝒮,𝒜𝔼τ(π,𝒫)[(s,a)τu^(s,a)]subscript𝜋subscriptΠ𝒮superscript𝒜subscript𝔼similar-to𝜏𝜋superscript𝒫delimited-[]subscript𝑠𝑎𝜏^𝑢𝑠𝑎\arg\max_{\pi\in\Pi_{\mathcal{S},\mathcal{A}^{\prime}}}\mathbb{E}_{\tau\sim(% \pi,\mathcal{P}^{\prime})}[\sum_{(s,a)\in\tau}\hat{u}(s,a)]roman_arg roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ ( italic_π , caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG ( italic_s , italic_a ) ]
7. Output π:π(f(s,a)|s)=π(a|s):𝜋𝜋conditional𝑓𝑠𝑎𝑠superscript𝜋conditional𝑎𝑠\pi:\pi(f(s,a)|s)=\pi^{*}(a|s)italic_π : italic_π ( italic_f ( italic_s , italic_a ) | italic_s ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a | italic_s )
Algorithm 3 PRC practical implementation

5 Experiments

5.1 Setup

Dataset: Following previous studies Hejna & Sadigh (2023), we construct our offline preference dataset from D4RL benchmark Fu et al. (2020), and generate synthetic preference following the standard techniques in previous PBRL studies (Kim et al., 2023; Christiano et al., 2017). Based on the definition of a standard offline preference-based reinforcement learning setting in Section 3, given a reward-based dataset from D4RL, we construct a preference-based dataset through the following process:

  1. 1.

    Randomly sample pairs of trajectory clips from the D4RL dataset. Following previous studies (Christiano et al., 2017), the length of the clip is set to be 20202020 steps.

  2. 2.

    For each pair of trajectory clips, compute the probability of a trajectory to be preferred based on the reward signals. To ensure consistency between different datasets, the reward signals are regularized to be bound in [1,1]11[-1,1][ - 1 , 1 ].

  3. 3.

    For each pair of trajectory clips, randomly generate a preference label through a Bernoulli trial with the probability computed above.

  4. 4.

    Return the preference dataset consisting of the trajectory clip pairs and the corresponding preference labels.

We consider various datasets from D4RL to represent the general learning scenarios. The robot control environments we choose include ‘Hopper’, ‘HalfCheetah’, and ‘Walker’. The types of trajectories we choose include ‘Medium’, ‘Medium-Replay’, and ‘Medium-Expert’. To make the number of state-action in the dataset aligned with that of the D4RL benchmark, the total number of trajectory pairs with a preference label is 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT.

Baseline Algorithms: Here, we consider multiple two-step learning algorithms as baselines and a state-of-the-art attack in the related offline PBRL setting, which are listed below:

  1. 1.

    Two-step learning with KL-regularization: We adopt the traditional two-step learning baseline with KL-regularization as introduced in Section 3. To make it a stronger baseline and the comparison fairer, we allow the algorithm to initialize from the behavior clone policy.

  2. 2.

    Naive two-step learning: To show a naive baseline where pessimistic learning is entirely absent, we adopt a naive two-step learning baseline, which is the same as the traditional two-step learning above except that there is no regularization at all during the reinforcement learning step. Note that we also allow the algorithm to initialize from the behavior clone policy in this baseline.

  3. 3.

    Reward modeling followed by standard offline RL algorithms: Another simple yet efficient two-step learning approach for offline PBRL problems is combining reward modeling with a standard reward-based offline RL algorithm. First, a reward model is learned from the preference dataset. Then, the reward model is used to provide scalar reward labels to the state-action in the dataset. Finally, we apply a standard offline RL algorithm IQL Kostrikov et al. (2021) training on the dataset with scalar reward labels.

  4. 4.

    Oracle: The oracle is trained with true rewards instead of preference labels. Here, we apply IQL training on the base D4RL dataset with true reward signals. Note that the information from the reward signals is strictly more than that from the preference signals in this case. The oracle should be considered as an upper bound on the performance of an offline PBRL algorithm.

  5. 5.

    Inverse preference learning (IPL) Hejna & Sadigh (2023): IPL is the state-of-the-art learning algorithm for a related offline PBRL setting that requires a preference dataset and a behavior dataset. To avoid underestimating the performance of IPL, we apply this algorithm to our PBRL setting and allow it to check the behaviors in the full D4RL dataset.

To ensure a fair comparison, all the methods that require reward modeling share the same learned reward model during training.

Training setup: To learn a reward model, we follow a standard supervised learning framework. Specifically, we use a multilayer perceptron (MLP) structure for the neural network to approximate the utility model. The neural network consists of 3333 hidden layers, and each has 64646464 neurons. We use a tanh function as the output activation so that the output is bound between [1,1]11[-1,1][ - 1 , 1 ].

To train a deterministic behavior clone policy, the neural network we use to represent the clone policy has the same structure as that for the utility model. To train a stochastic behavior clone policy, the network outputs the mean and standard deviation of a Gaussian distribution separately. The network is an MLP with 3333 hidden layers each with 64646464 neurons. The last layer is split into two for the two outputs. For the mean output, we use a linear function for the last layer. For the standard deviation, we use a linear function and an exp activation for the last layer.

In the reinforcement learning step, we use either the SAC algorithm or the PPO algorithm, depending on the dataset. The neural network for the actor is the same as the network for training a stochastic behavior clone policy. The neural network for the critic is the same as the network for training the reward model.

5.2 Learning Efficiency Evaluation

Here, we compare the efficiency of PRC against other baseline algorithms on different datasets. To straightforwardly compare the performance of different learning algorithms, we show the performance of the best policy learned by a method during training.

The scores in Table 1 are the standard D4RL score of the learned policies. The results show that the PRC algorithm has high learning efficiency. It is generally more efficient than other baselines and sometimes even competes with the oracle. Our algorithm performs better than other two-step learning baselines initialized from the behavior clone policy. This observation indicates that starting from the behavior clone policy or its neighborhood is not enough to achieve high learning efficiency, and training on the constrained action space is the key to high learning efficiency for PRC.

Dataset Oracle PRC Behavior Clone Naive two-step KL two-step IPL RM
HalfCheetah-Medium 47.3 ±plus-or-minus\pm± 0.2 47 ±plus-or-minus\pm± 0.5 41.7 ±plus-or-minus\pm± 1.0 40.1 ±plus-or-minus\pm± 0.5 41.9 ±plus-or-minus\pm± 0.1 42.7 ±plus-or-minus\pm± 0.1 43.2 ±plus-or-minus\pm± 0.1
HalfCheetah-Medium-Replay 46.1 ±plus-or-minus\pm± 0.1 43.3 ±plus-or-minus\pm± 0.2 32.9 ±plus-or-minus\pm± 9.2 31.9 ±plus-or-minus\pm± 0.8 32.9 ±plus-or-minus\pm± 0.9 34.9 ±plus-or-minus\pm± 3.1 40.1 ±plus-or-minus\pm± 0.7
HalfCheetah-Medium-Expert 92.1 ±plus-or-minus\pm± 0.7 78.4 ±plus-or-minus\pm±2.9 44.5 ±plus-or-minus\pm± 3.6 40.9 ±plus-or-minus\pm± 0.2 40.4 ±plus-or-minus\pm± 0.5 41.7 ±plus-or-minus\pm± 1.0 48.2 ±plus-or-minus\pm± 0.8
Hopper-Medium 76.1 ±plus-or-minus\pm± 1.2 71 ±plus-or-minus\pm± 7.5 51.4 ±plus-or-minus\pm± 3.9 67.5 ±plus-or-minus\pm± 2.4 73.5 ±plus-or-minus\pm± 4.0 72.4 ±plus-or-minus\pm± 7.4 67.2 ±plus-or-minus\pm± 0.3
Hopper-Medium-Replay 76.7 ±plus-or-minus\pm± 5.3 47 ±plus-or-minus\pm± 19.5 40.8 ±plus-or-minus\pm± 8.5 49.4 ±plus-or-minus\pm± 7.4 53.1 ±plus-or-minus\pm± 2.2 39.5 ±plus-or-minus\pm± 17.5 32.2 ±plus-or-minus\pm± 0.4
Hopper-Medium-Expert 113.1 ±plus-or-minus\pm± 0.4 100.2 ±plus-or-minus\pm± 9.5 46.7 ±plus-or-minus\pm± 10.9 67.3 ±plus-or-minus\pm± 7.1 71.4 ±plus-or-minus\pm± 10.9 76.9 ±plus-or-minus\pm± 8.1 97.1 ±plus-or-minus\pm± 5.2
Walker2d-Medium 84.5 ±plus-or-minus\pm± 0.3 84.4 ±plus-or-minus\pm± 0.8 73.5 ±plus-or-minus\pm± 1.7 81 ±plus-or-minus\pm± 0.8 79.5 ±plus-or-minus\pm± 2.1 80.8 ±plus-or-minus\pm± 1.6 81.9 ±plus-or-minus\pm± 2.5
Walker2d-Medium-Replay 83.1 ±plus-or-minus\pm± 2.3 87.9 ±plus-or-minus\pm± 6.1 11.6 ±plus-or-minus\pm± 0.6 49.8 ±plus-or-minus\pm± 7.3 45.4 ±plus-or-minus\pm± 0.1 49.3 ±plus-or-minus\pm± 3.5 71.8 ±plus-or-minus\pm± 6.4
Walker2d-Medium-Expert 111.5 ±plus-or-minus\pm± 0.4 110.4 ±plus-or-minus\pm± 0.6 95.9 ±plus-or-minus\pm± 3.1 93.4 ±plus-or-minus\pm± 4.9 92.1 ±plus-or-minus\pm± 2.3 107.5 ±plus-or-minus\pm± 3.0 105.7 ±plus-or-minus\pm± 6.9
Sum Totals 730.5 669.6 439 521.3 530.2 545.7 587.4
Table 1: Comparison between the performance of different learning methods.

Next, we empirically analyze why PRC should be an efficient learning algorithm from the aspect of pessimistic learning and reduced reinforcement learning complexity.

5.3 Pessimism Effectiveness

Here, we empirically verify that by training on a constrained action space, the PRC algorithm achieves effective pessimistic learning.

In Figure 1, we show some representative examples. For the PRC algorithm, during training, the performance trend of the learned policies measured on the learned reward model is aligned with that measured on the true reward. This suggests that the pessimistic learning in the PRC algorithm is effective: it mainly considers the policies that are supported by the dataset so that the agent can evaluate their relative performance accurately. In contrast, for naive two-step learning and KL-regularized two-step learning, we find multiple cases where the trends can even be opposite when evaluated on the reward model and on the true reward. These results suggest pessimistic learning is not efficient enough in the baseline methods compared to PRC.

Refer to caption
Figure 1: Comparison between the trend of the performance of the learned policies on the learned (simulated performance) and true reward models (true performance) during training. An algorithm is not pessimistic enough if the two trends are not aligned.

5.4 Reinforcement Learning Efficiency on Constrained Action Space

Here, we empirically verify that reinforcement learning is much easier on the constrained action space. In this case, we focus on the performance of the learned policies evaluated on the learned reward model.

We observe that in most cases, the performance of the learned policies in the PRC method is much higher than in the baseline two-step learning methods. In Figure 2, we show some representative examples. It is clear that while the baseline method struggles with low-performing policies, the PRC method learns policies of much higher performance.

Refer to caption
Figure 2: Comparison between the performance of the learned policies on the learned reward models during training. The reinforcement learning complexity is less in a setting if the simulated performance is high.

5.5 Reinforcement Learning Complexity in PBRL

Here, we empirically show that training on a reward model learned from preference signals is harder than that from true reward signals. Given the same D4RL dataset, we train two reward models based on the true reward signals and synthetic preference signals. Then, we train an efficient RL algorithm on the two rewards and compare the learning efficiency in the two cases.

The results in Figure 3 show that when learning on the reward model trained from true rewards, the RL agent can quickly learn some policies that have high performance on the reward model. In comparison, it can take much longer for the agent to find a good policy on the reward model from preference signals, and sometimes, the agent may not be able to find a decent policy after a lot of training epochs. This indicates that it is harder to learn from a reward model that is trained on preference signals instead of true reward signals.

Refer to caption
Figure 3: For each dataset, a pair of reward models are trained on the true rewards and preference signals. The same RL algorithm is applied to learn on both reward models. The learning difficulty on a reward model is less if the PPO algorithm can learn better policies according to the reward model.

6 Conclusion and Limitation

In this work, we propose a novel two-step learning algorithm PRC for the offline PBRL setting. We empirically show that the PRC has high learning efficiency and provide evidence for why it is a more efficient two-step learning algorithm than others. Our framework is limited to the typical offline learning setting and does not answer the question of which trajectories are more worthy of receiving preference labels. Our experimental evaluation is limited to continuous control problems, the standard benchmark in RL studies.

7 Reproducibility

In the main paper, we explain the setting of the problem we study. The codes we use for the experiments can be found in the supplementary materials.

References

  • Arora & Doshi (2021) Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
  • Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • Bradley & Terry (1952) Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Cheng et al. (2022) Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pp. 3852–3878. PMLR, 2022.
  • Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  • Dulac-Arnold et al. (2021) Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419–2468, 2021.
  • Fu et al. (2020) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fürnkranz et al. (2012) Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89:123–156, 2012.
  • Haydari & Yılmaz (2020) Ammar Haydari and Yasin Yılmaz. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(1):11–32, 2020.
  • Hejna & Sadigh (2023) Joey Hejna and Dorsa Sadigh. Inverse preference learning: Preference-based rl without a reward function. arXiv preprint arXiv:2305.15363, 2023.
  • Ibarz et al. (2018) Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
  • Kim et al. (2023) Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. Preference transformer: Modeling human preferences using transformers for rl. arXiv preprint arXiv:2303.00957, 2023.
  • Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  • Lee et al. (2021) Kimin Lee, Laura Smith, and Pieter Abbeel. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
  • Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Liu et al. (2023) Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  • Luong et al. (2019) Nguyen Cong Luong, Dinh Thai Hoang, Shimin Gong, Dusit Niyato, Ping Wang, Ying-Chang Liang, and Dong In Kim. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21(4):3133–3174, 2019.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Van Hasselt et al. (2018) Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
  • Wirth & Fürnkranz (2013) Christian Wirth and Johannes Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In Proceedings of the ECML/PKDD-13 Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards. Citeseer, 2013.
  • Wirth et al. (2017) Christian Wirth, Riad Akrour, Gerhard Neumann, Johannes Fürnkranz, et al. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
  • Yu et al. (2021) Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  • Zhan et al. (2023) Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023.
  • Zhu et al. (2023) Banghua Zhu, Jiantao Jiao, and Michael I Jordan. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.