DRL for optimal execution of profolio transaction

Based on State S, we do an action A = F(S); and after doing the action, we got an Reward R = G(A) ; Based on this, we want to maximize the future reward. (In Pon Game, 120min Agent learns to break the breaks on the left.)
Deep Q learning is like using deep neural network to estimate what action (finite actions) to do given current state (RGB input image), Then it would seem like a classification task to classify which action to take and the reward could be thought as label, in an abstract way. 【To approximate the Q-Value】
Policy gradient, Q-learning are limited to a number of actions
Policy gradient
Actor critic 【good for a million (continuous space) actions】— state of the art （Converge faster than Deep Q-learning）
1. Actor to predict actions (continuous output, infinite actions)
2. Critic to take state and action as input, predict the reward in the end.
Deep mind A3C: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2; GA3C, GPU accelerated A3C
1. Use Advantage = Q - V (residual ?)
Optimal execution of portfolio transactions is a well known problem in quantitative finance. Sell X shares before time T while minimizing the cost of trading /market impact /implementation shortage etc.
Traditional method for portfolio transaction: Almgren-Chriss
Applying RL to A-C algorithm (Define reward function first)

Last updated 7 years ago