Publikation: Bog/antologi/afhandling/rapport › Ph.d.-afhandling

- PhD dissertation Thomas Dueholm Hansen
Indsendt manuskript, 1,24 MB, PDF-dokument

- Thomas Dueholm Hansen, Danmark

In this dissertation we study strategy iteration (also known as policy iter-

ation) algorithms for solving Markov decision processes (MDPs) and two-

player turn-based stochastic games (2TBSGs). MDPs provide a mathematical

model for sequential decision making under uncertainty. They are widely

used to model stochastic optimization problems in various areas ranging

from operations research, machine learning, articial intelligence, economics

and game theory. The class of two-player turn-based stochastic games is

a natural generalization of Markov decision processes that is obtained by

introducing an adversary. 2TBSGs form an intriguing class of games whose

status in many ways resembles that of linear programming 40 years ago.

They can be solved eciently with strategy iteration algorithms, resembling

the simplex method for linear programming, but no polynomial time

algorithm is known. Linear programming is an exceedingly important problem

with numerous applications. The simplex method was introduced by

Dantzig in 1947, and has since then been studied extensively. It can be

shown that MDPs can be formulated as linear programs, thus, giving rise

to the connection.

Strategy iteration and simplex type algorithms are local search algorithms

that repeatedly improve a current candidate solution. We say that

the strategy iteration algorithm repeatedly performs improving switches, and

that the simplex method repeatedly performs improving pivots. The strategy

iteration algorithm and the simplex method are, in fact, paradigms for

solving 2TBSGs, MDPs, and linear programs. To obtain concrete algorithms

we must specify an improvement rule (pivoting rule). In this dissertation

we mainly focus on three improvement rules:

Howard's improvement rule: Update the current solution by simultaneously

performing all improving switches.

RandomEdge: Update the current solution by performing a single

uniformly random improving switch.

RandomFacet: A recursively dened, randomized improvement rule

that was introduced independently by Kalai and by Matousek, Sharir

and Welzl. The improvement rule works by recursively solving certain

sub-problems that have been chosen randomly.

One of the main results of the dissertation is to show that there exist

families of MDPs for which using the RandomEdge and RandomFacet

improvement rules lead to the expected number of improvement steps being

almost exponential in the size of the problem (the bounds have subexponential

form). Utilizing a tight connection between MDPs and linear programming,

it is shown that the same bounds apply to the corresponding pivoting

rules for the simplex method for solving linear programs. Prior to this result

no super-polynomial lower bounds were known for randomized pivoting

rules for the simplex method. On the other hand, essentially every known

deterministic pivoting rule had been shown to be exponential. The results

for RandomEdge and RandomFacet are, therefore, the rst of their kind

for randomized pivoting rules, which solves a more than forty year old open

problem.

Another important result of the dissertation is to show that Howard's

improvement rule for the strategy iteration algorithm (Howard's algorithm

in short) solves 2TBSGs with a xed discount in a number of iterations that

is polynomial in the (combinatorial) size of the corresponding games. This

proves, for the rst time, that 2TBSGs with a xed discount can be solved in

strongly polynomial time. Howard's algorithm for MDPs dates back to 1960

and is the algorithm of choice in practise for solving MDPs and 2TBSGs. It

is therefore appealing that the result is obtained for this particular classical

algorithm.

It is known that Howard's algorithm in general, when the discount is

not xed, may be exponential. This result relies heavily on the element of

uncertainty, however. When the MDPs are deterministic, the bound does

not apply. It is shown in the dissertation that for deterministic MDPs the

number of iterations performed by Howard's algorithm may be quadratic

in the size of the problem. Prior to this result the best lower bounds were

linear in the size of the problem.

ation) algorithms for solving Markov decision processes (MDPs) and two-

player turn-based stochastic games (2TBSGs). MDPs provide a mathematical

model for sequential decision making under uncertainty. They are widely

used to model stochastic optimization problems in various areas ranging

from operations research, machine learning, articial intelligence, economics

and game theory. The class of two-player turn-based stochastic games is

a natural generalization of Markov decision processes that is obtained by

introducing an adversary. 2TBSGs form an intriguing class of games whose

status in many ways resembles that of linear programming 40 years ago.

They can be solved eciently with strategy iteration algorithms, resembling

the simplex method for linear programming, but no polynomial time

algorithm is known. Linear programming is an exceedingly important problem

with numerous applications. The simplex method was introduced by

Dantzig in 1947, and has since then been studied extensively. It can be

shown that MDPs can be formulated as linear programs, thus, giving rise

to the connection.

Strategy iteration and simplex type algorithms are local search algorithms

that repeatedly improve a current candidate solution. We say that

the strategy iteration algorithm repeatedly performs improving switches, and

that the simplex method repeatedly performs improving pivots. The strategy

iteration algorithm and the simplex method are, in fact, paradigms for

solving 2TBSGs, MDPs, and linear programs. To obtain concrete algorithms

we must specify an improvement rule (pivoting rule). In this dissertation

we mainly focus on three improvement rules:

Howard's improvement rule: Update the current solution by simultaneously

performing all improving switches.

RandomEdge: Update the current solution by performing a single

uniformly random improving switch.

RandomFacet: A recursively dened, randomized improvement rule

that was introduced independently by Kalai and by Matousek, Sharir

and Welzl. The improvement rule works by recursively solving certain

sub-problems that have been chosen randomly.

One of the main results of the dissertation is to show that there exist

families of MDPs for which using the RandomEdge and RandomFacet

improvement rules lead to the expected number of improvement steps being

almost exponential in the size of the problem (the bounds have subexponential

form). Utilizing a tight connection between MDPs and linear programming,

it is shown that the same bounds apply to the corresponding pivoting

rules for the simplex method for solving linear programs. Prior to this result

no super-polynomial lower bounds were known for randomized pivoting

rules for the simplex method. On the other hand, essentially every known

deterministic pivoting rule had been shown to be exponential. The results

for RandomEdge and RandomFacet are, therefore, the rst of their kind

for randomized pivoting rules, which solves a more than forty year old open

problem.

Another important result of the dissertation is to show that Howard's

improvement rule for the strategy iteration algorithm (Howard's algorithm

in short) solves 2TBSGs with a xed discount in a number of iterations that

is polynomial in the (combinatorial) size of the corresponding games. This

proves, for the rst time, that 2TBSGs with a xed discount can be solved in

strongly polynomial time. Howard's algorithm for MDPs dates back to 1960

and is the algorithm of choice in practise for solving MDPs and 2TBSGs. It

is therefore appealing that the result is obtained for this particular classical

algorithm.

It is known that Howard's algorithm in general, when the discount is

not xed, may be exponential. This result relies heavily on the element of

uncertainty, however. When the MDPs are deterministic, the bound does

not apply. It is shown in the dissertation that for deterministic MDPs the

number of iterations performed by Howard's algorithm may be quadratic

in the size of the problem. Prior to this result the best lower bounds were

linear in the size of the problem.

Originalsprog | Engelsk |
---|

Forlag | Aarhus Universitet, Datalogisk Institut |
---|---|

Antal sider | 186 |

Status | Udgivet - 2012 |

Supervisor: Peter Bro Miltersen

Se relationer på Aarhus Universitet Citationsformater

Ingen data tilgængelig

ID: 47728517