# Regret-Based Reward Elicitation for Markov Decision Processes

50 %
50 %
Information about Regret-Based Reward Elicitation for Markov Decision Processes
Technology

Published on October 9, 2009

Author: kmregan

Source: slideshare.net

## Description

Talk given for University of Toronto Machine Learning Seminar. Fall 2009.

Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Craig Boutilier

Introduction 2 Motivation

Introduction 3 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards

Introduction 4 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards Specifying dynamics a priori can be difﬁcult • We can learn a model of the world in either an ofﬂine or online (reinforcement learning) setting

Introduction 5 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards In some simple cases reward can be thought of as being directly “observed” • For instance: the reward in a robot navigation problem corresponding to the distance travelled

Introduction 6 Motivation Except in some simple cases, the speciﬁcation of reward functions for MDPs is problematic • Rewards can vary user-to-user • Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward • Time consuming to specify reward for all states/actions Example domain: assistive technology

Introduction 7 Motivation However, • Near-optimal policies can be found without a fully speciﬁed reward function • We can bound the performance of a policy using regret

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work

Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 l2 y l1= x l2 = z

Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉 y l2 y z l1= x l2 = z l1= x

Decision Theory 10 Utility Axioms Completeness Transitivity Independence Continuity

Decision Theory 11 Utility Axioms Completeness For x, y ∈Θ Transitivity It is the case that either: Independence x is weakly preferred to y : x ± y Continuity y is weakly preferred to x : x y One is indifferent : x ~ y

Decision Theory 12 Utility Axioms Completeness For any x, y, z ∈Θ Transitivity If x ± y and y ± z Independence Then x ± z Continuity

Decision Theory 13 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1) Transitivity If l 1 f l 2 Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉 Continuity

Decision Theory 14 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L Transitivity If l 1 f l 2 f l 3 Independence Then for some p ∈(0,1) : Continuity l 2 ~ 〈l 1 , p, l 3 〉

Decision Theory 15 Utility Axioms Completeness There exists a utility function u : Θ → ° Transitivity Such that: Independence u(x) ≥ u(y) ⇔ x ± y Continuity n u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi ) i The utility of a lottery is the expected utility of its outcomes

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work

Preference Elicitation 17 Queries Ranking Please order this subset of outcomes Standard Gamble 〈x1 , x2 ,…, xm 〉 Bound u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )

Preference Elicitation 18 Queries Ranking Please choose a p for which you Standard Gamble are indifferent between y and the Bound lottery 〈x ï , p, x ⊥ 〉 ï ⊥ y ~ 〈x , p, x 〉 u(y) = p

Preference Elicitation 19 Queries Ranking Please choose a p for which y is at Standard Gamble least as good as the lottery 〈x ï ,b, x ⊥ 〉 Bound y ± 〈x ï ,b, x ⊥ 〉 u(y) ≥ b

Preference Elicitation 20 Preference Elicitation Rather than fully specifying a utility function, we 1. Make decision w.r.t. an imprecisely speciﬁed utility function 2. Perform elicitation until we are satisﬁed with the decision Prob Make Decision Satisﬁed? YES Done Util NO User Select Query

Preference Elicitation 21 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 22 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 23 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 24 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 25 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 26 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2

Preference Elicitation 27 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2

Preference Elicitation 28 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2

Preference Elicitation 29 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2

Preference Elicitation 30 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2

Preference Elicitation 31 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6

Preference Elicitation 32 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6

Preference Elicitation 33 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions

Preference Elicitation 34 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ arg max E [u(x)] u x∈Θ

Preference Elicitation 35 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ ( ) arg max max Pr Eu [u(x)] ≥ δ ≥ η x∈Θ δ 90% η = 90% 10% δ

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work

Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States Actions AGENT

Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Known Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD ? r(s) - Reward [or r(s, a) ] States Actions AGENT

Markov Decision Processes 38 MDP - Policies Policy A stationary policy π maps each state to an action For inﬁnite horizon MDPs, every policy is a stationary policy Policy Given a policy π , the value of a state is Value π  ∞ t  V (s0 ) = E  ∑ γ r π , s0   t=0 

Markov Decision Processes 39 MDP - Computing Value Function The value of a policy can be found by successive approximation V0π (s) = r(s, aπ ) V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ') s' M M M Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ') k−1 s' There will exist a ﬁxed point π π V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ ) s'

Markov Decision Processes 40 MDP - Optimal Value Functions Optimal We wish to ﬁnd the optimal policy π* Policy * π′ π : V ≥V ∀π ' π* π* Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ') a s' Equation

Markov Decision Processes 41 Value Iteration Algorithm Yields an Ú− optimal policy 1. initialize V0 , set n = 0, choose Ú> 0 2. For each s : Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ') a s' (1 − γ ) 3. If Vn+1 − Vn > Ú : 2γ increment n and return to step 2 We can recover the policy by ﬁnding the best one step action π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ') a s'

Markov Decision Processes 42 Linear Programming Formulation minimize V ∑ α (s)V (s) s subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s'

Markov Decision Processes 43 MDP - Occupancy Frequencies f (s, a) An occupancy frequency f (s, a) expresses the total discounted probability of being in state s and taking action a Valid ∑ f (s , a) = ∑ ∑ Pr(s 0 0 | s, a) f (s, a) − α (s0 ) ∀s0 a s a f (s, a)

Markov Decision Processes 44 LP - Occupancy Frequency min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a

Markov Decision Processes 44 LP - Occupancy Frequency ∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s) s a s min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a

Markov Decision Processes 45 MDP Summary Slide Policies Over the past couple of decades, there has Dynamics been lot of work done on scaling MDPs Rewards Factored Models Decomposition Linear Approximation

Markov Decision Processes 46 MDP Summary Slide Policies To use these algorithms we need a model of Dynamics the dynamics (transition function). There are techniques for: Rewards Deriving models of dynamics from data. Finding policies that are robust to inaccurate transition models

Markov Decision Processes 47 MDP Summary Slide Policies There has been comparatively little work on Dynamics specifying rewards Rewards Finding policies that are robust to imprecise reward models Eliciting reward information from users

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work

Text 50 Current Work MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query

Model : MDP 51 MDP - Reward Uncertainty We quantify the strict uncertainty over reward with a set of feasible reward functions R We specify R using a set of linear inequalities forming a polytope Where do these inequalities come from? Bound queries: Is r(s,a) > b? Policy comparisons: Is fπ ·r > fπ′ ·r ?

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work

Computation 53 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R

Computation 54 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R ) Maximums will exist at the vertices of F and R Rather than enumerating an exponential number of vertices we use constraint generation

Computation 55 Minimax Regret - Constraint Generation 1. We limit adversary • Player minimizes regret w.r.t. a small set of adversary responses 2. We untie adversary’s hands • Adversary ﬁnds maximum regret w.r.t. player’s policy • Add adversary’s choice of r and g to set of adversary responses Done when: untying adversary’s hands yields no improvement • ie. regret of player minimizing = regret of adversary maximizing

Computation 56 Constraint Generation - Player 1. Limit adversary minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN

Computation 57 Constraint Generation - Adversary 2. Untie adversary’s hands: Given player policy f max max g ·r − f ·r g∈F r∈R This formulation is a non-convex linear program We reformulate as a mixed integer linear program

(indeed, it is the maximally violated such constraint). So it Computation 58 is added to Gen and the process repeats. Constraint Generation ,-R) is realized by the following MIP, Computation of MR(f Adversary using value and Q-functions:1 2. maximize α · V − r · f (9) Q,V,I,r subject to: Qa = ra + γPa V ∀a∈A V ≥ Qa ∀a∈A (10) V ≤ (1 − Ia )Ma + Qa ∀a∈A (11) Cr ≤ d X Ia = 1 (12) a Ia (s) ∈ {0, 1} ∀a, s (13) ⊥ Ma = M − Ma Only tractablerepresents the adversary’s policy, with Ia (s) de- Here I for small Markov Decision Problems noting the probability of action a being taken at state s

) ! " # \$ % & ' ( )* )) Computation 59 +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states. Approximating Minimax Regret 9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6 )78) We )7(# the Max Regret MIP formulation relax 9.:54;<.0=//1/ )7() The )7)# value of the resulting policy is no longer exact, however, resulting reward still feasible. We ﬁnd optimal policy w.r.t. to )7)) resulting reward # ! " \$ % & ' () *+,-./01203454.6 9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6 )78) 9.:54;<.0=//1/ )7(# )7() )7)# )7)) ! " # \$ % & ' () *+,-./01203454.6 Figure 3: Relative approximation error of linear relaxation

Computation 60 Scaling (Log Scale) 89-/1<7=1+,-./012314565/7 )***** >?6@51A9B9-6?1C/D0/5 EFF02?9-65/1A9B9-6?1C/D0/5 )**** )*** 89-/1:-7; )** )* ) ! " # \$ % & ' ( )* )) +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states.

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work

Reward Elicitation 62 Reward Elicitation MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query

Reward Elicitation 63 Bound Queries Query Is r(s,a) > b? where b is a point between the upper and lower bounds of r(s,a) Gap Δ(s, a) = max r '(s, a) − min r(s, a) r' r At each step of elicitation we need to select the s, a parameters and b using the gap:

Reward Elicitation 64 Selecting Bound Queries Halve the Largest Gap (HLG) Current Solution (CS) Select the s,a with the Use the current solution g(s,a) largest gap Δ(s,a) [or f(s,a)] of the minimax regret calculation to weight Set b to the midpoint of the each gap Δ(s,a) interval for r(s,a) Select the s,a with the largest weighted gap g(s,a)Δ(s, a) Set b to the midpoint of the interval for r(s,a)

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work

Evaluation 66 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from ﬁxed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions

Evaluation 67 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret Halve the Largest Gap ƒ (MMR) (HLG) Maximin Regret Current Solution (MR) (CS)

Evaluation 68 Max Regret - Random MDP /01+2(3)(4+567+,'-.()+89+&'():(6 %" 0.35 #\$ 0.12 /01:-:;+<+=>? /:;:-01+<+=>? 0.30 %! /01:-:;+<+@A #! 0.10 /:;:-01+<+@A \$" 0.25 1 0.08 /01+2(3)(4 2)'(+3(4)(5 Max Regret True Regret \$! 0.20 0 0.06 #" 0.15 0.04 / #! 0.10 56+>+?@A 34+>+?@A \$ 0.02 " 0.05 56+>+BC 34+>+BC ! ! "! %!! !1 "! #!! #"! \$!! \$"! %!! ! &'()*+,'-.()

Evaluation 69 True Regret (Loss) - Random MDP 2)'(+3(4)(5+678+,'-.()+9:+&'();(7 #\$ 0.12 -:;+<+=>? <=>;-;?+@+ABC 01+<+=>? <;?;-=>+@+ABC -:;+<+@A #! 0.10 <=>;-;?+@+DE 01+<+@A <;?;-=>+@+DE 1 0.08 2)'(+3(4)(5 True Regret 0 0.06 0.04 / \$ 0.02 ! "! %!! 1 ! "! #!! #"! \$!! \$"! %!! &'()*+,'-.()

Evaluation 70 Queries per Reward Point - Random MDP <45;1=/7,0>0?+./4.509./0/.67/80914:; \$&! 700 \$!! 600 Most of reward 500 #&! space unexplored *+,-./0120/.67/80914:;5 #!! 400 "&! 300 We repeatedly query a small "!! 200 set of “high impact” reward points &! 100 ! ! " # \$ % & ' ( ) *+,-./01203+./4.5

Evaluation 71 Autonomic Computing Setup Host 1 Demand Resource 2 Hosts Total 3 Demand levels Resource 3 Units of Resource M Model Host k Demand Resource 90 States 10 Actions

Evaluation 72 Max Regret - Autonomic Computing Queries vs. Max Regret 0.7 0.12 Maximin Minimax Regret 0.6 0.10 0.5 0.08 True Regret Max Regret 0.4 0.06 0.3 0.04 0.2 0.02 0.1 egret 0.0 0.00 1000 1 0 200 400 600 800 1000 0 Queries

Evaluation 73 True Regret (Loss) - Autonomic Computing Queries vs. True Regret 0.12 Maximin egret Minimax Regret 0.10 0.08 True Regret 0.06 0.04 0.02 0.00 1000 0 1 200 400 600 800 1000 Queries

Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work

Introduction 75 Overview MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query

Introduction 75 Contributions Compute 1. A technique for ﬁnding robust policies using Satisﬁed? YES Done Robust Policy minimax regret NO 2. A simple elicitation procedure that quickly leads to Select Query near-optimal/optimal policies

Conclusion 76 Future Work Bottleneck: Adversary’s max regret computation Scaling Idea: The set Γ of adversary policies g that will ever be a regret maximizing response can be small Factored MDPs Approaches that uses Γ to Richer efﬁciently compute max regret Queries We have An algorithm to ﬁnd Γ A theorem that shows the algorithm runs in time polynomial in the number of policies found

Conclusion 77 Future Work Scaling Working with Factored MDPs will Factored Model problems in a more natural way MDPs Richer Allow us to use lower the dimensionality of Queries the reward functions Leverage existing techniques for scaling MDPs that take advantage of factored

Conclusion 78 Future Work In state s which action would you like to take? Scaling Factored MDPs In state s do you prefer action a1 to a2 ? Richer Queries Do you prefer sequence s1 , a1 , s2 , a2 ,…sk to ′ ′ ′ ′ ′ s , a , s , a ,…s ? 1 1 2 2 k

Conclusion 79 Future Work Do you prefer tradeoff Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and f (s1 , a4 ) = f2 amount of time doing (s1 , a4 ) Factored or MDPs f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and ′ Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ? Queries f1 s Cab Available f1 s No Street Car f2 f2 a Take Cab a Waiting s No Street Car s Cab Available f2 f1 f1 f2 a Waiting a Take Cab

Thank you.

Regret-Based Reward Elicitation for Markov Decision Processes Kevin M Regan University of Toronto Craig Boutilier

f g r ax min r·f (7) subject to: γE f + α = 0 Appendix 82 F r∈R γE g + α = 0 Full Formulation on the adversary. If MR(f , R) = MMR (R) then the con- to com uncertainty in any MDP pa- Cr ≤ at straint for g, r is satisﬁed d the current solution, and in- mine k has focused on uncertainty deed all unexpressed constraints must be satisﬁed as well. have t This is equivalent to a minimization: the of eliciting information The process then terminates with minimax optimal solu- freque rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that f . Otherwise, (8) exact f ,δ the constraint for g, r is violated in the current relaxation Master uted for uncertain transition (indeed, it is the r · g − r · f violated suchF, r ∈ R So it subject to: maximally ≤ δ ∀ g ∈ constraint). We ha an alt riterion by decomposing the is added to Gen and the+ α = 0 repeats. γE f process sarial nd using dynamic program- ization to ﬁnd the worst case Computation of MR(f , R) is realized by the following MIP, (for a ]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of using value and to the standard for m rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio maximize α · V − r · f (9) n value of an MDP (we em- The inﬁnite number of constraints can be reduced: ﬁrst we Q,V,I,r tice): need only retain as potentially active those ∀ a ∈ A subject to: Qa = ra + γPa V constraints for the in ch to ours below). Delage vertices of polytope R; Qa for any r ∈ R, weaonly require V ≥ and ∀ ∈A (10) tors. oblem of uncertainty over re- V ≤ (1 − a )M + Qa ∀a∈A ∗ the constraint correspondingIto itsaoptimal policy gr . How- (11) does n functions) in the presence of policy rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben- Cr ≤ d Subproblem than maximin. They also ders’ decomposition [2]a to iteratively generate constraints. X I =1 (12) constr remai ng rewards using sampling to At each iteration, two optimizations are solved. The master a value e of information of noisy in- Ia (s) ∈ {0, of ∀a, s problem solves a relaxation 1} program (8) using only a (13) that is ard space. The percentile ap- small subset of the constraints,M⊥ Ma = M − corresponding to a subset a this s n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de- Here I represents the adversary’s policy, Intuitively, in lem s noting the probability of action a being taken at state s

Evaluation 83 Maximin Value - Random MDP 2345-56+738'(+9:;+,'-.()+<=+&'()5(: #!! 1.00 %" 0.35 0.30 %! 0.95 1" \$" 0.25 0.90 1! 2345-56+738'( Maximin Value /01+2(3)(4 Max Regret \$! 0.20 0.85 0" #" 0.15 0! 0.80 #! 0.10 2345-56+>+?@A 0.75 /" 2565-34+>+?@A " 0.05 2345-56+>+BC 2565-34+>+BC /! 0.70 1 ! ! "! #!! #"! \$!! \$"! %!! ! &'()*+,'-.()

Computation 84 Regret Gap vs Time 3.4567/859/:1;/+,-./!/3.<= "&# "\$# "## *# 3.4567/859 Regret Gap (# &# \$# # !\$# !"### # "### \$### %### &### '### (### )### *### +,-./0-12

 User name: Comment:

## Related presentations

#### Neuquén y el Gobierno Abierto

October 30, 2014

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

#### Decision CAMP 2014 - Erik Marutian - Using rules-b...

October 16, 2014

In this presentation we will describe our experience developing with a highly dyna...

#### Schema.org: What It Means For You and Your Library

November 7, 2014

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

#### WearableTech: Una transformación social de los p...

November 3, 2014

Un recorrido por los cambios que nos generará el wearabletech en el futuro

#### O Impacto de Wearable Computers na vida das pessoa...

November 5, 2014

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

#### All you need to know about the Microsoft Band

November 6, 2014

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

## Related pages

### Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto,ON, CANADA

### Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan and Craig Boutilier Department of Computer Science University of Toronto

### Regret-based Reward Elicitation for Markov Decision Processes

The specification of aMarkov decision process (MDP) can be difficult. Reward function specification is especially problematic; in practice, it is often

### Regret-based Reward Elicitation for Markov Decision Processes

BibTeX @MISC{Regan_regret-basedreward, author = {Kevin Regan and Craig Boutilier}, title = {Regret-based Reward Elicitation for Markov Decision Processes ...

### Title: Regret-based Reward Elicitation for Markov Decision ...

Abstract: The specification of aMarkov decision process (MDP) can be difficult. Reward function specification is especially problematic; in ...

### Regret-based reward elicitation for Markov decision ...

Abstract The specification of a Markov decision process (MDP) can be difficult. Reward function specification is especially problematic; in practice, it is ...

### Eliciting Additive Reward Functions for Markov Decision ...

Eliciting Additive Reward Functions for Markov Decision Processes ... We extend “ﬂat” regret-based reward elicitation for MDPs [9] ...