Alborz

60 %
40 %
Information about Alborz

Published on September 25, 2007

Author: knowdiff

Source: slideshare.net

Description

Knowledge Diffusion Network
Visiting Lecturer Program (114)
Speaker: Alborz Geramifard
Ph.D. Candidate
Department of Computing Science, Edmonton, University of Alberta, Canada
Title: Incremental Least- Squares Temporal Difference Learning

Time: Tuesday, Sep 11, 2007, 12:00-1:00 pm
Location: Department of Computer Engineering, Sharif University of Technology, Tehran

Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca

Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca

Acknowledgement Richard Sutton Michael Bowling Martin Zinkevich

Agencia

Agencia Generous Wise Kind Forgetful

Agencia Envirocia

Agencia Agriculture Politics Research ... Envirocia

The Ambassador of Envirocia

The Vazir of Agencia The Ambassador of Envirocia

The Vazir of Agencia The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”

The Vazir of Agencia Having consultants for various subjects might be the answer ... The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”

Consultants

Consultants

Ambassador

Vazir Ambassador

Vazir Ambassador “I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”

Agencia Envirocia

Reinforcement Learning Framework Observation Agent Action Reward Environment

Observation Agent Action Reward Environment

Observation Agent Action Reward Environment

Observation Agent Action Reward Environment Goal

Observation Agent Action Reward Environment Goal

Observation Agent Action Mo Reward re Co mp Environment lica ted S tate Spa Goal ce Mo re Co mplic a ted Tas ks

Summary

Data Efficiency Speed Summary

Summary Data Efficiency Part I: King Speed

Summary Part II: King + Consultants Data Efficiency Part I: King Speed

Summary Part II: King + Consultants Data Efficiency ? Part I: King Speed

Summary LSTD Data Efficiency TD Speed

Summary LSTD Data Efficiency iLSTD TD Speed

Contributions

Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results

Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results

Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results

Outline

Outline Motivation Introduction The New Approach Dimension Selection Conclusion

Outline Motivation Introduction The New Approach Dimension Selection Conclusion

Markov Decision Process a a S, A, Pss , Rss ,γ

Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85

Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85 B.S., Working, +60, B.S. Studying, -50, M.S. , ...

Policy Evaluation

Policy Evaluation Policy Improvement

Policy Evaluation Policy Improvement

Policy Evaluation Policy Improvement

Notation rt+1 V (s) π Scalar Regular φ(s) µ(θ) Vector Bold Lower Case ˜ At A Matrix Bold Upper Case

Policy Evaluation ∞ V (s) = E rt s0 = s, π π t−1 γ t=1

Linear Function Approximation n V (s) = θ · φ(s) = θi φi (s) i=1

Sparsity of features k features are active at any given Sparsity: Only moment. k n

Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648

Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6

Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6 Keep away soccer [Stone et al. 05]: 416 << 10^4

Temporal Difference Learning TD(0)

Temporal Difference Learning TD(0) (π) a,r st+1 st [Sutton 88]

Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt [Sutton 88]

Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt Linear Function Approximation θ t+1 = θ t + αt φ(st )δt (V ) [Sutton 88]

TD(0) Properties Computational complexity O(k) per time step Data inefficient Only last transition

TD(0) Properties Computational complexity Constant O(k) per time step Data inefficient Only last transition

Least-Squares TD (LSTD) Sum of TD updates

Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates

Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1

Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt

Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ

Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ µt (θ) = 0 −1 θ=A b

LSTD Properties Computational complexity O(n ) per time step 2 Data efficient Look through all data

LSTD Properties Quadratic Computational complexity O(n ) per time step 2 Data efficient Look through all data

Outline

Outline Motivation Introduction The New Approach Dimension Selection Conclusion

The New Approach

The New Approach A and b matrices change on each iteration.

The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt

The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt bt = bt−1 + rt φt ∆bt At = At−1 + φt (φt − γφt+1 ) T ∆At

Incremental LSTD µt (θ) = bt − At θ

Incremental LSTD µt (θ) = bt − At θ [Geramifard, Bowling, Sutton 06]

Incremental LSTD µt (θ) = bt − At θ Fixed θ Fixed A and b [Geramifard, Bowling, Sutton 06]

Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b [Geramifard, Bowling, Sutton 06]

Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). [Geramifard, Bowling, Sutton 06]

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ? n

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ? n

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?

iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?

chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat

chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat

chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat

iLSTD Per-time-step computational complexity O(mn + k ) 2 More data efficient than TD

iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of iterations per time step More data efficient than TD

iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of features Number of iterations per time step More data efficient than TD

iLSTD Per-time-step computational complexity O(mn + k ) 2 Maximum number of active features Number of features Number of iterations per time step More data efficient than TD

iLSTD Per-time-step computational complexity O(mn + k ) 2 Linear Maximum number of active features Number of features Number of iterations per time step More data efficient than TD

iLSTD Theorem : iLSTD converges with probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which µt is non-zero are selected in the limit an infinite number of times.

Empirical Results

Settings

Settings Averaged over 30 runs Same random seed for all methods Sparse matrix representation iLSTD Non-zero random selection One descent per iteration

Boyan Chain

Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 [Boyan 99]

Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) n = 25 (Medium) n = 100 (Large) [Boyan 99]

Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) k=2 n = 25 (Medium) n = 100 (Large) [Boyan 99]

Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)

Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)

Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)

Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)

Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)

Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)

Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)

Mountain Car

Mountain Car Goal Mountain Car Tile coding n = 10,000 Position = -1 (Easy) k = 10 Position = -.5 (Hard) [For details see RL-Library]

Easy Mountain Car 7 6 10 10 TD 6 10 iLSTD 4 10 TD Loss Loss 5 10 TD LSTD 5 LSTD 10 iLSTD Loss Function iLSTD 3 10 4 4 10 10 LSTD 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 Time (s) Episode ∗ Loss = ||b − A θ||2 ∗ 2 10 0 200 400 600 800 1000 Episodes

Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)

Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)

Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car

Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car

Outline

Outline Motivation Introduction The New Approach Dimension Selection Conclusion

Dimension Selection Random ✓

Dimension Selection Random ✓ Greedy

Dimension Selection Random ✓ Greedy ε-Greedy

Dimension Selection Random ✓ Greedy ε-Greedy Boltzmann

Greedy Dimension Selection Pick the one with highest value of |µt (i)|

Greedy Dimension Selection Pick the one with highest value of |µt (i)| Not proven to converge.

ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy

ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy Convergence proof applies.

Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random

Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random

Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random ψ×m

Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m

Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m Convergence proof applies.

Empirical Results ε-Greedy: ε =.1 Boltzmann: ψ = 10^-9, τ = 1

Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car

Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car

Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car

Running Time Random Greedy Boltzmann !-greedy 11.0 Clocktime/step (ms) 8.8 6.6 4.4 2.2 0 l e sy d um al ar rg Ea Sm i La H ed M Boyan chain Mountain car

Outline

Outline Motivation Introduction The New Approach Dimension Selection Conclusion

Conclusion

Conclusion LSTD Data Efficiency TD Speed

Conclusion LSTD Data Efficiency iLSTD TD Speed

Conclusion LSTD Data Efficiency iLSTD TD Speed

Conclusion LSTD Data Efficiency iLSTD TD Speed

Conclusion LSTD Data Efficiency iLSTD TD Speed

Conclusion No learning rate! LSTD Data Efficiency iLSTD TD Speed

Questions ?

Questions ...

Questions ... What if someone uses batch-LSTD?

Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent?

Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent? Hmm ... What about control?

Thanks ...

Add a comment

Related presentations

Related pages

restaurant-alborz.de

Goethestr.22. 30169 Hannover. Tel: 0511/15961. Inh: Manuel Janpour. Öffnungszeiten. Montag bis Freitag: 16:00 Uhr - 22:00 Uhr. Samstag und Sonntag: 13:00 ...
Read more

Alborz (Provinz) – Wikipedia

35.833333333333 51. Die Provinz Alborz (persisch استان البرز, DMG Ostān-i Alborz) ist eine der 31 Provinzen des Iran. Hauptstadt der Provinz ist ...
Read more

Alborz - 16 Fotos & 22 Beiträge - Persisches Restaurant ...

22 Beiträge zu Alborz "Kann mich Sebastian S. anschließen. Das ist Essen ist lecker. Die Portionen ordentlich. Die Bedienung ist sehr freundlich. Am ...
Read more

Alborz - Wikipedia

Alborz ( listen (help · info) Persian: البرز ‎‎), also spelled as Alburz, Elburz or Elborz, is a mountain range in northern Iran that ...
Read more

Campionatul Judetean de Minifotbal Alborz - Timis

Datorita faptului ca echipa Poli a schimbat terenul de joc si va juca la Baza 1 a Politehnicii Timisoara , suntem nevoiti sa intrerupem campionatul sambata ...
Read more

Simon Alborz

Private Webseite von Simon Alborz in Hamburg, Deutschland. Das Glück, kein Reiter wird's erjagen, es ist nicht dort und ist nicht hier.
Read more

alborz, Teheran - Restaurant Bewertungen, Telefonnummer ...

alborz, Teheran: 247 Bewertungen - bei TripAdvisor auf Platz 12 von 404 von 404 Teheran Restaurants; mit 4/5 von Reisenden bewertet.
Read more

Alborz Province - Wikipedia

Alborz Province (Persian: استان البرز ‎‎, Ostan-e Alborz) is one of the 31 provinces of Iran, centered in Karaj. Alborz Province was formed ...
Read more

Alborz - Persisches Restaurant - Goethestr. 22, Mitte ...

2 Beiträge zu Alborz "Das Restaurant war sauber, ordentlich und ansprechend. Das Essen war jedoch nicht wirklich mein Fall. Das Fleisch (Lamm) war zwar ...
Read more

Elburs-Gebirge – Wikipedia

Der Elburs, auch Alburs und Albors(gebirge) (persisch البرز Alborz), ist ein Hochgebirge im nördlichen Iran zwischen dem Kaspischen Meer und dem ...
Read more