Automated State and Action Space Design for Multi-Objective Reinforcement Learning


While Reinforcement Learning (RL) has shown impressive performance in games (e.g., Go and chess [14,13], DoTA2 [2], Starcraft II [18,17], etc.), it remains a challenging problem to use RL in real-world scenarios due to the gigantic sample complexity and limited ability to generalize to unknown environments. To train
an RL agent with traditional online methods, millions of interactions with the environments are often needed, which is definitely infeasible for real-world applications due to various limitations (e.g., imperfect simulators, expensive simulation, license issues, etc). Furthermore, the learned agent can often only be used in the same environment, lacking the ability to generalize [4].


Motivation and Related Work

Unlike many game or robotics tasks that have a ``natural'' MDP formulation (e.g., board situation as the state and candidates moves as the action), real-world optimization problems can have infinitely many MDP formulations. This opens many research problems that are largely ignored in the current RL community:

  1. Design of state spaces: Several previous works studying combinatorial optimization using RL operate in fixed and exponentially large state spaces. Solution concepts are often search based, such as predicting the full solution in one shot [19, 1, 10], predicting parts of the solution sequentially [24, 5, 8], or starting with a feasible solution and progressively refining it [3]. This motivates the question of automatically learning state
    spaces, which can significantly narrow down the effective search space.
  2. Design of action spaces: Likewise, the design of the action space also matters: making one critical decision often is much more important than making many irrelevant decisions. In contrast, current RL environments like games are less affected by this issue, since their actions are, by human design, informative and often critical to the outcome of the game.
  3. In this research direction, instead of designing the state/action space manually, we aim to develop novel techniques to achieve automatic design. The approaches in these papers build upon a framework for black-box function optimization and are based on exploration, space partitioning and sampling subroutines. We have existing works (e.g., LaNAS [21], LaMCTS [20] and LaP3[22]) that dynamically learns the action space, based on explored samples collected on the black-box (value) function being optimized.

Technical Objective

The problem of designing automated and data dependent state and action spaces has largely been studied from the perspective of single-objective reinforcement learning. The extension to multi-objective settings is possible by reduction to multi-objective optimization [23].

Multi-objective optimization: n real-world optimization problems often multiple criteria must be considered simultaneously, leading to multi-objective optimization (MOO) [11]. In MOO, we look for a Pareto Frontier (PF) in which solutions are not dominated by any other feasible solutions (i.e. any improvements in one criteria must come at the cost of another). In particular, in multi-objective RL, optimal actions lie on the PF of the pareto optimal value function for the problem [16, 9]. Leveraging learning-based techniques for MOO remains a challenging problem.While there are some initial results in this space for optimizing
black-box functions [23], a lot of improvement is still possible.

Data-dependent state and action spaces for the problem can help narrow the search space in RL, leading to faster training of agents. Likewise, similar capacity models are expected to learn better representations across these smaller state and action spaces, which can help improve generalization performance.

Improving generalization in RL: How to ensure that the learned agent can adapt in new but highly related tasks remains a key challenge in RL [7]. Through the lens of representation learning, generalization in RL is intimately connected to designing good state and action spaces for the problem. MOO can provide an alternate paradigm to action abstraction [12] by allowing training RL agents over data-dependent action spaces. This is especially important in multi-agent settings, where the joint action space grows rapidly as a function of the number of agents.

These questions motivate us to design algorithms for MOO which are,

  • Implementable for multi-objective problems, such as avoiding exact computation of hypervolumes [6].
  • Achieve state of the the art performance in practice.
  • Provable theoretical guarantees, such as asymptotic convergence to the PF for simple function classes.

In general, the PF may not be convex, or even connected, a phenomenon likely to be the case in reinforcement learning tasks. The challenge is further exacerbated in discrete hyperparameter tuning/architecture search where it is often time-consuming or even infeasible to obtain gradients of the objectives. In order to design algorithms for real-world settings, we plan to use the quadratic setting (individual objectives are quadratic functions) as an initial theoretical testbed for designing gradient-based and gradient-free algorithms.We intend to build upon the meta-algorithm of [23] in two ways:

  • Recursively partitioning the multi-objective into subproblems involving only a few objectives at once and combining the resulting PFs.In the quadratic case, this approach can provably be used to hierarchically reconstruct the Pareto frontier.
  • Merging and pruning nodes in the tree data structure of [23] during training to improve runtime, and prevent oversampling in less promising regions of the space.If a large region of space generated by combining many smaller regions still has poor dominance number, then there is little reason to continue sampling from such regions.