2021-01-11 06:25

# [Research/suggestion] Next gen algorithm idea

Hi, as the title says it's to keep track of an idea/suggestion for further generation of the algorithm. I know a lot of new ideas are suggested, I try to keep this kind of post thing to the minimum.

The idea is based on the fact that the evaluation network and policy network do not benefit of the result of the tree search already performed. As a human, if I take a lot of time to find a local sequence of moves that works for a specific game position, I will store it in my memory, and this memory of my finding will influence my policy and evaluation net (my human ones) for the rest of the game, I can by example judge it's not the right time to actually kill something but I have found how exactly doing it, or I can have taken a lot of time to find a refutation of a try to exploit one of my weakness then keep it in my memory. Then when the time comes later in the game, i will have better instinct and evaluation of position in which the sequence can be used (I'll basically try it at first glance). The fact is, that even if my conclusion was not my first guess (policy) and not in line with my evaluation then I now know something additional and this will point out my reading to the same kind of sequence even after some other stones have been added on the board. (without having first to dry out the exact same bad instinct I may have on some local aspect, move after move, node after node) It seems quite possible to do this kind of thing with the algorithm, it seems there are plenty of good and bad ways to do it. Hardcoding the format of what is stored seems the easiest: by example keep a sliding average of move visits and their evaluation but on the same NxN matrix (something saying, that in the recent tree search a sequence involving C3 by example has occurred X time and has been evaluated in average to V, this for each move (we can have better stuff it's just for the proof of concept)) then the idea is to inject this kind of hardcoded aggregated memory of the recent tree search as input planes of both net. We can ask if such a thing might converge, but as long as we hardcode a format of memory like that, I would say yes because the network can choose to totally ignore the data if the training does not show it relevant. If this works fine, the network could use it to compute a better first guess of move and their evaluation, mixing in it's computation of policy and eval the result of this aggregation (itself produced from a large tree search and a huge number of network results, specific to the game context). Have you crossed such an idea? That would be some kind of migration to a statefull use of the NN.

Chuck Norris mode: not hardcoding the memory format and drive this storage process of the tree search by another NN (it's way more free and flexible, but way more difficult to design...).

Regards.

• 点赞
• 写回答
• 关注问题
• 收藏
• 复制链接分享
• 邀请回答

#### 8条回答

• I have been thinking something similar. One possible way is to feed the MCTS result from a previous move as input of NN when compute the next move.

点赞 评论 复制链接分享
• Very intresting idea. Now the search really has a lot of problems...

点赞 评论 复制链接分享
• : in fact that would be quick to do, because there are already history planes, and we can add history of policy/eval conclusion to those. BUT, it will miss a big part of useful conclusions: if you find a good move in let's say the left part of the tree search what you want with such a memory system is being able to influence your policy and eval even on the right part of the tree search. You want to be able to influence the NN conclusions at some nodes with it's conclusion at other nodes that are neither parent or child of it (even indirectly).

点赞 评论 复制链接分享
• Time to think? :)

点赞 评论 复制链接分享
• The suggestion looks similar to what is done with the All-moves-as-first heuristic and Rapid Action Value Estimation (RAVE); see Section 4 in a paper by Gelly and Silver (2011) [1]. The AlphaGo Zero paper in Nature (2016) [2] mentions the techniques as follows (Methods section): "AlphaGo does not employ the all-moves-as-first or rapid action value estimation heuristics used in the majority of Monte Carlo Go programs; when using policy networks as prior knowledge, these biased heuristics do not appear to give any additional benefit."

(This is not to dismiss the suggestion to try it again; just to note that some variant of it has been tried already.)

[1] Gelly, Sylvain, and David Silver. "Monte-Carlo tree search and rapid action value estimation in computer Go." Artificial Intelligence 175, no. 11 (2011): 1856-1875. Online: https://doi.org/10.1016/j.artint.2011.03.007; also here.

[2] Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser et al. "Mastering the game of Go with deep neural networks and tree search." nature 529, no. 7587 (2016): 484-489. Online: https://doi.org/10.1038/nature16961; also here.

点赞 评论 复制链接分享
• In fact I find the Deepmind sentence quite unclear because we dont know if they use the "other" playout results as additional input of the NN (what I suggest above) or if they use it to tweak manually the NN output (manually biased the trained heuristic). If they are talking of the second, it's of course not a very elegant stuff and might be more destructive than constructive, the first one, where the NN actually learns itself how to use this memory and when use it, seems way more flexible and free. Do you see of which of those they are talking about?

点赞 评论 复制链接分享
• , I don't want to guess just what the DeepMind folks explored before rejecting it. They provide two citations: the Gelly and Silver (2011) work already noted and a 2003 ACG conference paper by Bouzy and Helmstetter [1].

[1] Bouzy, Bruno, and Bernard Helmstetter. "Monte-Carlo Go developments." In Advances in computer games, pp. 159-174. Springer, Boston, MA, 2004. Online: https://doi.org/10.1007/978-0-387-35706-5_11.

点赞 评论 复制链接分享
• closing older issues, no discussion for >6 months

点赞 评论 复制链接分享