Q学习价值过高

I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.

Here's how I implemented the solution to an m,n,k-game environment: At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1

func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}

Note:

agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
agent.prevScore holds the reward to previous state-action
The reward argument represents the reward for current step's state-action (Qmax)

With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow. I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!

Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):

Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000

The reward function returns:

Agent won: 1
Agent lost: -1
Draw: 0
Game continues: 0.5

But you can see that the minimum value is zero, and the maximum value is too high.

It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!

agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))

Any ideas on how to fix this? Is that kind of state-action value normal in Q-Learning?!

Update: After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).

After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:

Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718

and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongshukou0240 2016-06-01 07:07
关注
If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):

On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

强化学习，python python 机器学习
2021-05-20 22:05

回答 1 已采纳我寻思你也没定义cfg啊。。第一个代码段的那个cfg只是形参。cfg包含哪些东西，值都是多少，你这里都没有定义。一般是自己写一个配置文件，然后python读取进来，或者自己定义一个cfg。
w25q写入读取错误 stm32 单片机
2022-12-28 22:01

回答 2 已采纳望采纳！！点击该回答右侧的“采纳”按钮即可采纳！我猜测可能是因为w25q模块的数据没有写入到永久存储器中，导致复位或重启后数据丢失。可以尝试使用w25q模块的写入数据命令将数据写入到永久存储器中，或
强化学习DQN:AttributeError: 'CartPoleEnv' object has no attribute 'seed' python pytorch 深度学习
2022-09-15 12:25

回答 2 已采纳你把gym换成0.25.2版本就行了。 pip install gym==0.25.2
q学习基于价值的强化学习算法
2020-10-11 19:04

weixin_26729375的博客 Please follow this link to understand the basics of Reinforcement Learning.... Let’s explain various components before Q... 让我们在Q学习之前解释各种组件。基于策略的基于价值的RL (Policy-based vs value...
python中acf图和pacf图难以判断p，q值 python 数据挖掘机器学习
2022-04-19 18:42

回答 1 已采纳我这里有老师可以帮忙，你现在还需要吗？
C++输出未定义 ++q c++ 蓝桥杯
2022-10-04 17:24

回答 2 已采纳 std::cout << square(++q) << ' ' << q << std::endl; 展开变成 std::cout <<
关于#深度学习知识蒸馏算法loss函数计算#的问题？ python 人工智能深度学习
2022-08-12 14:57

回答 1 已采纳原论文是第一种，所以用第一种肯定没问题，第二种也有人用，主要是为了保证两个loss贡献差不多，毕竟softloss里有做平滑，所以都可以哈，还是要看哪种收敛更快，效果很好
强化学习极简入门：通俗理解MDP、DP MC TC和Q学习、策略梯度、PPO
2023-02-10 11:12

v_JULY_v的博客强化学习里面的概念、公式，相比ML/DL特别多，初学者刚学RL时，很容易被接连不断的概念、公式给绕晕，而且经常忘记概念与公式符号表达的一一对应(包括我自己在1.10日之前对好多满是概念/公式的RL书完全看不下去，...
django 使用Q对象遇到问题 django python
2022-05-27 09:47

回答 2 已采纳或许你应该这样写 Author.objects.filter(Q(uname='xiaoi') | Q(uname='damu')) 参考: Django中Q查询及Q
询问一下关于Q绑定的问题 c语言系统安全网络安全
2022-08-20 22:06

回答 1 已采纳因为这些q邦都是由tx内部的内鬼流出的，由于账号的注册时间问题，就会出现社工库没有及时更新那些新注册账号的邦所以会有有的查不到的情况；泄露的原因本人分析有两点，首先是公司内部有问题，有内鬼将数据卖出，
c# while循环输入q跳出 c# 有问必答
2021-11-14 20:20

回答 1 已采纳题主要的代码如下 using System; namespace ConsoleApp1 { class Program { static void Main(str
【深度强化学习】动作价值函数Q、状态价值函数V
2023-12-16 21:32

WilliamChou123的博客动作值函数（Action-Value Function）和状态值函数（State-Value Function）是强化学习中两个关键的价值函数，用于评估智能体的行为和状态。动作值函数关注在给定状态下采取某个具体动作的价值，而状态值函数关注在...
强化学习Qlearning算法matlab
2018-07-11 12:09

强化学习是人工智能领域的一种重要学习方法，它通过与环境的交互来优化决策策略，而Q-Learning作为强化学习中的一种模型-free（无需环境模型）且离策略的学习算法，具有广泛的应用价值。本教程将深入讲解如何在...
Q学习算法（Q-learning）
2012-04-18 09:00

Q学习算法，作为强化学习的一种核心算法，是无模型学习策略的重要代表，...这两份文档可能涵盖了Q学习的理论基础、算法实现、与其他算法的对比，以及实际问题的解决方案等内容，对于学习和研究Q学习是极有价值的资源。
【强化学习】 Q-Learning
2021-01-22 14:48

蓝色蛋黄包的博客 QLearning是强化学习算法中value-based的算法，，Q即为Q（s,a）就是在某一时刻的 s 状态下(s∈S)，采取动作a (a∈A)动作能够获得收益的期望，环境会根据agent的动作反馈相应的回报reward r，所以算法的主要思想...
没有解决我的问题, 去提问

悬赏问题

¥15 openwrt双栈NAT
¥15 部分网页页面无法显示！
¥15 怎样解决power bi 中设置管理聚合，详细信息表和详细信息列显示灰色，而不能选择相应的内容呢？
¥15 QTOF MSE数据分析
¥15 平板录音机录音问题解决
¥15 请问维特智能的安卓APP在手机上存储传感器数据后，如何找到它的存储路径?
¥15 (SQL语句|查询结果翻了4倍)
¥15 Odoo17操作下面代码的模块时出现没有'读取'来访问
¥50 .net core 并发调用接口问题
¥15 网上各种方法试过了，pip还是无法使用

Q学习价值过高

2条回答 默认 最新

悬赏问题

2条回答默认最新