Skip to main content

Showing 1–15 of 15 results for author: Skalse, J

  1. arXiv:2406.15753  [pdf, other

    cs.LG cs.AI stat.ML

    The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

    Authors: Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse

    Abstract: In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the training distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main so… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 58 pages, 1 figure

  2. arXiv:2405.06624  [pdf, other

    cs.AI

    Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

    Authors: David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum

    Abstract: Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these appro… ▽ More

    Submitted 8 July, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

  3. arXiv:2403.06854  [pdf, other

    cs.LG

    Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

    Authors: Joar Skalse, Alessandro Abate

    Abstract: Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $π$). To do this, we need a behavioural model of how $π$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship bet… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  4. arXiv:2401.14811  [pdf, ps, other

    cs.AI cs.LG

    On the Limitations of Markovian Rewards to Express Multi-Objective, Risk-Sensitive, and Modal Tasks

    Authors: Joar Skalse, Alessandro Abate

    Abstract: In this paper, we study the expressivity of scalar, Markovian reward functions in Reinforcement Learning (RL), and identify several limitations to what they can express. Specifically, we look at three classes of RL tasks; multi-objective RL, risk-sensitive RL, and modal RL. For each class, we derive necessary and sufficient conditions that describe when a problem in this class can be expressed usi… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

    Journal ref: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:1974-1984, 2023

  5. arXiv:2310.11840  [pdf, other

    cs.LG

    On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

    Authors: Rohan Subramani, Marcus Williams, Max Heitmann, Halfdan Holm, Charlie Griffin, Joar Skalse

    Abstract: Most algorithms in reinforcement learning (RL) require that the objective is formalised with a Markovian reward function. However, it is well-known that certain tasks cannot be expressed by means of an objective in the Markov rewards formalism, motivating the study of alternative objective-specification formalisms in RL such as Linear Temporal Logic and Multi-Objective Reinforcement Learning. To d… ▽ More

    Submitted 17 February, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: Published as a conference paper at ICLR 2024

  6. arXiv:2310.09144  [pdf, other

    cs.LG

    Goodhart's Law in Reinforcement Learning

    Authors: Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse

    Abstract: Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decrease… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

  7. arXiv:2309.15257  [pdf, other

    cs.LG cs.AI

    STARC: A General Framework For Quantifying Differences Between Reward Functions

    Authors: Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

    Abstract: In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward fun… ▽ More

    Submitted 11 March, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

  8. Lexicographic Multi-Objective Reinforcement Learning

    Authors: Joar Skalse, Lewis Hammond, Charlie Griffin, Alessandro Abate

    Abstract: In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorit… ▽ More

    Submitted 28 December, 2022; originally announced December 2022.

    Journal ref: IJCAI 2022; Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. Main Track, Pages 3430-3436

  9. arXiv:2212.03201  [pdf, ps, other

    cs.LG

    Misspecification in Inverse Reinforcement Learning

    Authors: Joar Skalse, Alessandro Abate

    Abstract: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $π$. To do this, we need a model of how $π$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationsh… ▽ More

    Submitted 24 March, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 2023

  10. arXiv:2209.13085  [pdf, other

    cs.LG stat.ML

    Defining and Characterizing Reward Hacking

    Authors: Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger

    Abstract: We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhacka… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  11. arXiv:2203.07475  [pdf, other

    cs.LG cs.AI stat.ML

    Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

    Authors: Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave

    Abstract: It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally chara… ▽ More

    Submitted 7 June, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: ICML 2023. 9 pages main paper, 26 pages total, 3 figures

    ACM Class: I.2.6

  12. arXiv:2101.00280  [pdf, ps, other

    cs.AI

    A General Counterexample to Any Decision Theory and Some Responses

    Authors: Joar Skalse

    Abstract: In this paper I present an argument and a general schema which can be used to construct a problem case for any decision theory, in a way that could be taken to show that one cannot formulate a decision theory that is never outperformed by any other decision theory. I also present and discuss a number of possible responses to this argument. One of these responses raises the question of what it mean… ▽ More

    Submitted 1 January, 2021; originally announced January 2021.

    Comments: 4 pages

  13. arXiv:2006.15191  [pdf, other

    cs.LG stat.ML

    Is SGD a Bayesian sampler? Well, almost

    Authors: Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis

    Abstract: Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with l… ▽ More

    Submitted 24 October, 2020; v1 submitted 26 June, 2020; originally announced June 2020.

    Journal ref: Journal of Machine Learning Research, 22 79 (2021), 1-64

  14. arXiv:1909.11522  [pdf, other

    cs.LG stat.ML

    Neural networks are a priori biased towards Boolean functions with low entropy

    Authors: Chris Mingard, Joar Skalse, Guillermo Valle-Pérez, David Martínez-Rubio, Vladimir Mikulik, Ard A. Louis

    Abstract: Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability $P(t)$ that it represents a Boolean function that classifies t points… ▽ More

    Submitted 2 January, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

  15. arXiv:1906.01820  [pdf, other

    cs.AI

    Risks from Learned Optimization in Advanced Machine Learning Systems

    Authors: Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

    Abstract: We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances… ▽ More

    Submitted 1 December, 2021; v1 submitted 5 June, 2019; originally announced June 2019.