Hello everyone,
I am sure that many of you have known this algorithm. As far as I am concerned, it is one of the most popular techniques in AI. I am learning it and I have a lot of questions regarding the intuition behind it, and hence I am writing today, with hope that some of you can share your experience with me.
To summarize a bit, a MCTS algorithm is consisted of 4 phases — Select, Expand, Simulate and Propagate, and is based on a formula called UCB1 to balance between exploitation and exploration. My first concern is about the UCB1 formula, that is UCB1 = (total wins / total visits) + C * sqrt(ln(total visits in parent node) / total visits in current node). My question is what will change if I, instead of using "total number of wins", use something like "total reward" (which may be HP loss, energy increase and so on)? In such case, will the term sqrt(ln(total visits in parent node) / total visits in current node) stay the same?
Thank you for your time and consideration. I am looking forward to hearing from you.