I want to introduce several RL issues based on my survey and the challenges encountered in real-world industry applications. Hope this note is helpful for you.

Offline evaluation/learning The problem is how to evaluate a policy based on historical data and even learn from the data. First, in E-commerce scenario, we are not allowed to deploy a random policy for serving. It is usually required to start with a policy that is at least as good as the existing policy (i.e., the baseline which is rule-based or learnt from some supervi. Vanilla imitation learning is an option. MARWIL is a better choice, when the historical data include rewards in addition to state-action pairs. MARWIL uses estimated advantages to weight the state-action pairs and provides a TRPO-style guarantee. I implemented MARWIL for Ray RLLib and validated that, over CartPole-v0, the learnt policy is able to achive a 150 mean_episode_reward, even though the historic data consist of 100 mean_episode_reward rollouts. Second, we are not able to push the latest network to our serving components due to some engineering issues. Thus, the learning procedure is in a batch mode. For off-policy algorithms like DQN and DDPG, as Q-learning is vulnerable and unable to provide any TRPO-style guarantee, we’d better conduct counterfactual policy evaluation which means evaluate the current policy without any interaction with the real environment. The most important progresses (1, 2, 3) in this line is doubly robust evaluation. According to a discussion with Yitao Liang, a developer of Facebook’s Horizon, they include these methods to evaluate the updated policy before pushing them to the serving components. As for on-policy algorithms (mainly the most policy gradients algorithms), since the performance guarantee often means a tiny update, batch mode is tantamount to extremely slow convergence (em…but we have to admit that an extremely large batch provides a good MC estimation with small variance). To accelerate the learning course of policy gradients algorithms, async updating or say off-policy should be supported where the pivotal issue is to handle policy lag. In essence, most methods eliminate the policy lag via importance sampling (IS). Several NIPS18 papers focus on reducing the variance of IS, especially when the episode is very long (variance increases exponentially w.r.t. the episode length). 4 is interesting in my opinion.