# Off-Policy Estimation using Reinforcement Learning

5 min readIn conventional Reinforcement Learning (RL) settings, an agent interacts with an environment in a web fashion, meaning that it collects data from its interaction with the environment that’s then wont to inform changes to the policy governing its behavior. In contrast, offline RL refers to the setting where historical data are wont to either learn good policies for acting in an environment, or to gauge the performance of the latest policies. As RL is increasingly applied to crucial real-life problems like robotics and recommendation systems, evaluating new policies within the offline setting — estimating the expected reward of a target policy given historical data generated from actions that are supported a behavior policy — becomes more critical. However, despite its importance, evaluating the general effectiveness of a target policy supported historical behavior policies are often a touch tricky, thanks to the problem in building high-fidelity simulators and also the mismatch in data distributions.

**Example of Reinforcement Learning**

As an easy example, consider the sport Pong: one might wish to predict if a replacement strategy (the target policy) increases the prospect of winning when considering only historical data collected from previous strategies (behavior policies) and without actually playing the sport . If one were interested only within the performance of the behavior policy, an honest metric could be to average the rewards of all the time steps from the historical data. However, since historical data is predicated on actions determined by the behavior policy and not the target policy, this easy average of rewards within the off-policy data wouldn’t yield an honest estimate of the target policy’s long-term reward. Instead, proper correction must be made to get rid of the bias resulting from having two different policies (i.e., the difference in data distribution).

In “Black-Box Off-Policy Estimation for Infinite-Horizon Reinforcement Learning”, accepted at ICLR 2020, we propose a replacement approach to gauge a given policy from offline data supported estimating the expected reward of the target policy as a weighted average of rewards in off-policy data. Since meaningful weights for the off-policy data aren’t known a priori, we propose a completely unique way of learning them. Unlike most of the previous works, our method is especially suitable once we decide to use historical data where trajectories are significantly lengthy or have infinite horizons. We empirically demonstrate the effectiveness of this approach employing a number of classical control benchmarks.

**Background**

generally, one approach to unraveling the off-policy evaluation problem is to create a simulator that mimics the interaction of the agent with the environment, then evaluate the target policy against the simulation. While the thought is natural, building a high-fidelity simulator for several domains is often extremely challenging, particularly people who involve human interactions. an alternate approach is to use the weighted average of rewards from the off-policy data as an estimate of the typical reward of the target policy. This approach is often more robust than employing a simulator because it doesn’t require modeling assumptions about world dynamics. Indeed, most previous efforts using this approach have found success on short-horizon problems where the amount of your time steps (i.e., the length of knowledge trajectory) is restricted. However, because the horizon is extended, the variance in predictions made by most of the previous estimators often grows exponentially, necessitating novel solutions for long-horizon problems, and even more so within the extreme case of the infinite-horizon problem.

** Our Approach for Infinite-Horizon Reinforcement Learning**

Our method of OPE leverages a well known statistical technique called importance sampling through which one can estimate the properties of a specific distribution (e.g., the mean) from samples generated by another distribution. especially, we estimate the long-term average reward of the target policy using the weighted average of rewards from the behavior policy data. the problem during this approach is the way to choose the weights so as to get rid of the bias between the off-policy data distribution which of the target policy while achieving the simplest estimate of the target policy’s average reward. One important point is that if the weights are normalized to be positive and sum up to at least one, then they define a probability distribution over the set of possible states and actions of the agent.

On the opposite hand, a private policy defines a distribution on how often an agent visits a specific state or performs a specific action. In other words, it defines a singular distribution of states and actions. Under reasonable assumptions, this distribution doesn’t change over time and is named a stationary distribution. Since we are using importance sampling, we naturally want to optimize the weights of the estimator such as the stationary distribution of the target policy matches the distribution induced by the weights of our estimator. However, the matter remains that we don’t know the stationary distribution of the target policy since we don’t have any data generated by that policy. a method to beat this problem is to form sure that the distribution of weights satisfies properties that the target policy distribution has, without actually knowing what this distribution is. Luckily, we will cash in of some mathematical “trickery” to unravel this. While the complete details are found in our paper, the upshot is that while we don’t know the stationary distribution of the target policy (since we’ve no data collected from it) we will determine that distribution by solving an optimization problem involving a backward operator, which describes how an agent transitions from other states and actions to a specific state and action using probability distributions as both input and output. Once we are done, the weighted average of rewards from historic data gives us an estimate of the expected reward of the target policy.

** Experimental Results**

Employing a toy environment called ModelWin that has three states and two actions, we compare our work with a previous state-of-the-art approach (labeled “IPS”), alongside a naive method during which we simply average rewards from the behavior policy data. The figure below shows the log of the root-mean-square error (RMSE) with reference to the target policy reward as we modify the number of steps collected by the behavior policy. The naive method suffers from an outsized bias and its error doesn’t change even with more data collected by increasing the length of the episode. The estimation error of the IPS method decreases with increasing horizon length. On the opposite hand, the error exhibited by our method is little, even for brief horizon length.

We also compare the performance of our approach with other approaches (including naive estimator, IPS, and model-based estimator) on several classic control problems. As we will see in the figures below, the naive averaging performance is nearly independent of the number of trajectories. Our method outperforms other approaches in three sample environments: CartPole, Pendulum, and MountainCar.

To summarize, during this post we described how one can use historic data gathered consistent with a behavior policy to assess the standard of a replacement target policy. a stimulating future direction of this work is to use structural domain knowledge to enhance the algorithm. We invite interested readers to read our paper to find out more about this work.