Queueing Network Controls via Deep Reinforcement Learning
release_m26py6ciynfobihsysteheirom
by
J. G. Dai, Mark Gluzman
2020
Abstract
Novel advanced policy gradient (APG) methods, such as Trust Region policy
optimization and Proximal policy optimization (PPO), have become the dominant
reinforcement learning algorithms because of their ease of implementation and
good practical performance. A conventional setup for notoriously difficult
queueing network control problems is a Markov decision problem (MDP) that has
three features: infinite state space, unbounded costs, and long-run average
cost objective. We extend the theoretical framework of these APG methods for
such MDP problems. The resulting PPO algorithm is tested on a parallel-server
system and large-size multiclass queueing networks. The algorithm consistently
generates control policies that outperform state-of-art heuristics in
literature in a variety of load conditions from light to heavy traffic. These
policies are demonstrated to be near-optimal when the optimal policy can be
computed.
A key to the successes of our PPO algorithm is the use of three variance
reduction techniques in estimating the relative value function via sampling.
First, we use a discounted relative value function as an approximation of the
relative value function. Second, we propose regenerative simulation to estimate
the discounted relative value function. Finally, we incorporate the
approximating martingale-process method into the regenerative estimator.
In text/plain
format
Archived Files and Locations
application/pdf 1.9 MB
file_tmvqgyejkjgrbmalwtf7v5ghu4
|
arxiv.org (repository) web.archive.org (webarchive) |
2008.01644v5
access all versions, variants, and formats of this works (eg, pre-prints)