Published on March 16, 2014
Temporal-diﬀerence search in computer Go David Silver · Richard S. Sutton · Martin M¨uller March 3, 2014 1 / 24
Three Major Sections Section 3: Shape features (almost exact duplicate of previous paper from 2004) Section 2: TD-Search (Simulation-based search methods for game play) Section 4: Dyna-2 algorithm (Introducing concept of long and short-term memory in simulated search) 2 / 24
AI Game Playing In AI the most successful game play strategies have followed these steps: 1st: Positions are evaluated by a linear combination of many features. Position is broken down into small local components 2nd: Weights are trained by TD learning and self play Finally: A linear evaluation function is combined with a suitable search algorithm to produce a high-performing game. 3 / 24
Sec 3.0: TD Learning with Local Shape Features Approach: Shapes are an important part of Go. Master-level players are thinking in terms of general shapes. Objective is to win the games of Go. Rewards: r = 1 if black wins and r= 0 if white wins. There is a state value s for each intersection of the board. Three possible values: empty, white, or black. Local shape l is found by all combinations of shapes up to 3x3 region. φ(s) = 1 if board matches local shape li The value function V π (s) is the expected total reward from state s when following policy π (Probability of winning) Value function is approximated by logistic-linear combination of shape features. Model-free learning since learning the value function. V (s) = σ(φ(s) · θLI + φ(s) · θLI ) Black’s winning probability from state s. Use two-ply update: TD(0) error is calculated after both player and opponent have made a move V (st+2) Using self play: Agent and opponent use the same policy π(s, a) 4 / 24
5 / 24
6 / 24
Sec 3.2.1: Training Procedure Objective: Train the weights of the value function V(s) and update using logistic TD(0). Initialize all weights to zero and run a million games of self play to ﬁnd the weights of value function V(s) Black and white select moves using an − greedy policy over same value function. Using self play: Agent and opponent use the same − greedy policy π(s, a) Terminate when both players pass 7 / 24
Results with diﬀerent sets of local shape features 8 / 24
Results with diﬀerent sets of local shape features cont’d 9 / 24
Alpha Beta Search AB Search: During game play this is a technique used to search the potential moves from state s to s (Note: we are now using the learnt value function V(s)) For example, for a depth of 2, if it’s white’s move we consider all of white’s moves and blacks’s responses to all those moves. We maintain an alpha and a beta for the upper and lower bound (value) of the move. 10 / 24
AB Search Results 11 / 24
Section 4.0: Temporal-diﬀerent search Idea: If we are in a current state s there is always a subgame G of original game G. Apply TD learning to G using subgames of self-play, that start from the current state st . Simulation-based search: agent samples episodes of simulated experience and updates its value function from simulated experience. Begin in state s0. At each step µ of simulation, an action au is selected according to a simulation policy, and a new state su+1 and reward ru+1 is generated by the MDP model. Repeat until terminal state is reached. 12 / 24
MCTS and TD-diﬀerence search cont’d TD-search uses value function approximation on the current sub-graph (our V(s) from before). We can update all similar states since the value function approximates the value of any position s ∈ S. MCTS must wait on many time-steps to until getting a ﬁnal outcome. Depends on all the agents decisions throughout the simulation. TD search can bootstrap, as before, using steps between subsequent states. Does not need to wait until the end to make corrections to to TD-error (just as in TD-diﬀerence learning). MCTS is currently the best known example of simulated search 13 / 24
Linear TD Search Linear TD search is an implementation of TD search where the value function is updated online. The agent simulates episodes of experience from the current state by sampling its current policy π(s, a) and from transition model Pπ ss and reward model Rπ ss (note: P is transition probabilities and R is reward function) Linear TD search is applied to the sub-game at the current state. And instead of using a search tree, the agent is going to approximate the value function by using a linear combination given by: Qµ(s, a) = φ(s, a) · θµ Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] θ is the weights, µ is the current time step, and φ(s, a) is feature vector representing states and actions After each step the agent updates the parameters by TD learning, using TD(λ) 14 / 24
TD Search Results 15 / 24
TD search in computer Go In section 3.0 the value of each shape was learned oﬄine using TD learning by self play. This is considered myopic since each section is evaluated without knowledge of the entire board. The ideas is to use local shape features in TD search. TD search can learn the value of each feature in the context current board context or state µ as discussed previously. This allows the agent to focus on what works well now. Issues: By starting simulations from current position we break the symmetries. So the weight sharing (feature reduction) based on this breaking is lost. 16 / 24
Change to TD search Remember with shapes in section 3.0 we were learning the Value function V(s). We modify the TD search algorithm to update the weights of our value function. Linear TD search is applied to the sub-game at the current state. And instead of using a search tree, the agent is going to approximate the value function by using a linear combination given by: δθ = α θ(st ) ||θ(st )||2 (V (st+2) − V (st)) 17 / 24
Experiments with TD search Ran tournaments between diﬀerent versions of RLGO of at least 200 games Used bayeselo program to calculate Elo rating Recall that for TD search we are doing simulated search (slide 13) and we have no simulation policy for each step µ. A simulation policy maximizes the action from every state in the MDP. Fuego 0.1 is a “vanilla” policy that they use as a default policy. They do this to incorporate some prior knowledge into the simulation policy. TD search assumes no prior knowledge. They switch to this every T moves. Switching policy ever 2-8 moves resulted in a 300 point Elo improvement. The results show the importance of what they call “temporality”. Focusing on the agents resources at the current moment. 18 / 24
TD Results 19 / 24
TD Results 20 / 24
Dyna-2: Integrating short and long-term memories Learning algorithms slowly exact knowledge from the complete history of training data. Search algorithms use and extend this knowledge, rapidly and online, so as to evaluate local states more accurately. Dyna-2: combines both TD learning and TD search Sutton had a previous algorithm Dyna (1990) that applied TD learning both to real experience and simulated experience. They key idea with Dyna-2 is to maintain two separate memories: a long-term memory that is learnt from real experience; and short-term memory that is used during search and is update from simulated experience. 21 / 24
Dyna-2 cont’d Deﬁne a short Qµ(s, a) and long-term value function Q(s, a) Q(s, a) = φ(s, a) · θ Q(s, a) = φ(s, a) · θ + φ(s, a) · θ Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] , θ is the weights, and φ(s, a) is feature vector representing states and actions. The short term value function,Q(s, a), uses both memories to approximate true value function. Two phase search: AB search is performed after each TD search. 22 / 24
Dyna-2 results 23 / 24
Discussion What is the signiﬁcance of the 2nd author? What are your thoughts on the overall performance of this algorithm? Why didn’t they outperform modern MCTS methods? Are there any other applications where this might be useful? Did you think the paper did a good job explaining their approach? Was it descriptive enough? What feature of Go, as compared to chess, checkers, or backgammon, makes it diﬀerent in the reinforcement learning environment. Is using only a 1x1 feature set of shapes equivalent to the notion of ”over-ﬁtting”? What is the advantage of two-ply update verse 1-ply update that they referred to in section 3.2? What is the trade-oﬀ as we go up to 6 ply? 24 / 24
Share Goprez sg. Embed. size(px) ...
Goprez sg Comments. RECOMMENDED. RECOMMENDED. Kp61hs30 Ra-6 Chassis Service Manual. SG. SG. SG. Sg. sg. Sg 247675. Sg 247538. miroir sg. Sg 0047260. Sg ...
Goprez sg Sg 246172 Sg 3 Sg 247137 Comments. RECOMMENDED. ... Technical feasibility study CSE SG Group meeting Rome, 15 May 2012. Sg 247675. Sg 247538.
Goprez sg 3 months ago. 2 views. Technology. Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing Lecture 6 on information retrival ...
A sg-. ol' ieo, o'. A de ro- lpari-dor da Rs-i. s rpoa Slh da [0. so. ... aes rs o Z sI a l qd t d P 4. o) goprez d d eto m ir a ltetta J. -c i lrI
Hi, my name is Olga, I am 22 years old, Want you talk with me? Hi, my name is Aleksandra, I am 30 years old, Want you talk with me? Hi, my name is Oksana,
Combattre la cellulite huiles essentielles Causes cellulite front thigh – produits cellulite. Publié le octobre 1, 2014
LBLSIZE=2048 FORMAT='BYTE' TYPE='IMAGE' BUFSIZ=20480 DIM=3 EOL=0 RECSIZE=1024 ORG='BSQ' NL=1024 NS=1024 NB=1 N1=1024 N2=1024 N3=1 N4=0 NBB=0 NLB=0 HOST ...