doulin4844 2016-05-30 11:24
浏览 66
已采纳

Q学习价值过高

I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.

Here's how I implemented the solution to an m,n,k-game environment: At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1

func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}

Note:

  • agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
  • agent.prevScore holds the reward to previous state-action
  • The reward argument represents the reward for current step's state-action (Qmax)

With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow. I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!

Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):

Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000

The reward function returns:

  • Agent won: 1
  • Agent lost: -1
  • Draw: 0
  • Game continues: 0.5

But you can see that the minimum value is zero, and the maximum value is too high.

It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!

agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))

Any ideas on how to fix this? Is that kind of state-action value normal in Q-Learning?!


Update: After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).

After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:

Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718

and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.

  • 写回答

2条回答 默认 最新

  • dongshukou0240 2016-06-01 07:07
    关注

    If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions): enter image description here

    On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 openwrt双栈NAT
  • ¥15 部分网页页面无法显示!
  • ¥15 怎样解决power bi 中设置管理聚合,详细信息表和详细信息列显示灰色,而不能选择相应的内容呢?
  • ¥15 QTOF MSE数据分析
  • ¥15 平板录音机录音问题解决
  • ¥15 请问维特智能的安卓APP在手机上存储传感器数据后,如何找到它的存储路径?
  • ¥15 (SQL语句|查询结果翻了4倍)
  • ¥15 Odoo17操作下面代码的模块时出现没有'读取'来访问
  • ¥50 .net core 并发调用接口问题
  • ¥15 网上各种方法试过了,pip还是无法使用