Learning Successor Affordances as Temporal Abstractions

Successor features (SF) provide a convenient representation for value functions that can be used to obtain value functions under new reward functions by simply recombining the features via linear combination. However, successor features, by construction, require the underlying policy of the value function to be fixed. This can be undesirable whenthe goal is to find the optimal value function each different reward function as the successor features for different policy can be different.

In this project, we explore successor affordances (SA) that can provide a basis for optimal value functions for a variety of reward functions. SA resemble SF in the way that they can be linearly combined to obtain the value functions for new reward functions, but focuses on optimal value functions rather than value functions of a fixed policy. Intuitively, SA could be useful for determining the optimal policy for a range of tasks (defined by the reward functions). As the name suggests, SA contain information about which tasks that can be achieved and the optimal policy for these tasks can be extracted easily.

r(\bs, \ba, \bg) &= \phi(\bs, \ba)^\top \xi(\bg) \\ Q^\star(\bs, \ba, \bg) &= \psi(\bs, \ba)^\top \xi(\bg) \\     \psi(\bs, \ba)^\top \xi(\bg) &= \phi(\bs, \ba)^\top \xi(\bg) + \mathbb{E}_{\bs'}\left[\max_{\ba'} \psi(\bs', \ba')^\top \xi(\bg) \right]