 ## Worst-case Analysis of Strategy Iteration and the Simplex Method

Publikation: Bog/antologi/afhandling/rapportPh.d.-afhandling

### Dokumenter

• Thomas Dueholm Hansen, Danmark
In this dissertation we study strategy iteration (also known as policy iter-
ation) algorithms for solving Markov decision processes (MDPs) and two-
player turn-based stochastic games (2TBSGs). MDPs provide a mathematical
model for sequential decision making under uncertainty. They are widely
used to model stochastic optimization problems in various areas ranging
from operations research, machine learning, articial intelligence, economics
and game theory. The class of two-player turn-based stochastic games is
a natural generalization of Markov decision processes that is obtained by
introducing an adversary. 2TBSGs form an intriguing class of games whose
status in many ways resembles that of linear programming 40 years ago.
They can be solved eciently with strategy iteration algorithms, resembling
the simplex method for linear programming, but no polynomial time
algorithm is known. Linear programming is an exceedingly important problem
with numerous applications. The simplex method was introduced by
Dantzig in 1947, and has since then been studied extensively. It can be
shown that MDPs can be formulated as linear programs, thus, giving rise
to the connection.
Strategy iteration and simplex type algorithms are local search algorithms
that repeatedly improve a current candidate solution. We say that
the strategy iteration algorithm repeatedly performs improving switches, and
that the simplex method repeatedly performs improving pivots. The strategy
iteration algorithm and the simplex method are, in fact, paradigms for
solving 2TBSGs, MDPs, and linear programs. To obtain concrete algorithms
we must specify an improvement rule (pivoting rule). In this dissertation
we mainly focus on three improvement rules:
Howard's improvement rule: Update the current solution by simultaneously
performing all improving switches.
RandomEdge: Update the current solution by performing a single
uniformly random improving switch.
RandomFacet: A recursively dened, randomized improvement rule
that was introduced independently by Kalai and by Matousek, Sharir
and Welzl. The improvement rule works by recursively solving certain
sub-problems that have been chosen randomly.
One of the main results of the dissertation is to show that there exist
families of MDPs for which using the RandomEdge and RandomFacet
improvement rules lead to the expected number of improvement steps being
almost exponential in the size of the problem (the bounds have subexponential
form). Utilizing a tight connection between MDPs and linear programming,
it is shown that the same bounds apply to the corresponding pivoting
rules for the simplex method for solving linear programs. Prior to this result
no super-polynomial lower bounds were known for randomized pivoting
rules for the simplex method. On the other hand, essentially every known
deterministic pivoting rule had been shown to be exponential. The results
for RandomEdge and RandomFacet are, therefore, the rst of their kind
for randomized pivoting rules, which solves a more than forty year old open
problem.
Another important result of the dissertation is to show that Howard's
improvement rule for the strategy iteration algorithm (Howard's algorithm
in short) solves 2TBSGs with a xed discount in a number of iterations that
is polynomial in the (combinatorial) size of the corresponding games. This
proves, for the rst time, that 2TBSGs with a xed discount can be solved in
strongly polynomial time. Howard's algorithm for MDPs dates back to 1960
and is the algorithm of choice in practise for solving MDPs and 2TBSGs. It
is therefore appealing that the result is obtained for this particular classical
algorithm.
It is known that Howard's algorithm in general, when the discount is
not xed, may be exponential. This result relies heavily on the element of
uncertainty, however. When the MDPs are deterministic, the bound does
not apply. It is shown in the dissertation that for deterministic MDPs the
number of iterations performed by Howard's algorithm may be quadratic
in the size of the problem. Prior to this result the best lower bounds were
linear in the size of the problem.
Originalsprog Engelsk
Forlag Aarhus Universitet, Datalogisk Institut 186 Udgivet - 2012

### Note vedr. afhandling

Supervisor: Peter Bro Miltersen

Citationsformater