强化学习，gym.reset（）重置环境为什么不是返回一组为0 的数据，而是返回一定范围的数组？

在学习强化学习，为什么强化学习的gym.reset() 返回的是一个不为零的数组，我理解的重置不就是归零吗？比如
CartPole-v0 环境。为什么def reset()那儿要返回4个-0.05到0.05的随机数呢？
def reset(
self,
*,
seed: Optional[int] = None,
return_info: bool = False,
options: Optional[dict] = None,
):
super().reset(seed=seed)
** self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
** self.steps_beyond_done = None
if not return_info:
return np.array(self.state, dtype=np.float32)
else:
return np.array(self.state, dtype=np.float32), {}

"""
Classic cart-pole system implemented by Rich Sutton et al.
Copied from http://incompleteideas.net/sutton/book/code/pole.c
permalink: https://perma.cc/C9ZM-652R
"""
import math
from typing import Optional, Union

import numpy as np

import gym
from gym import logger, spaces
from gym.error import DependencyNotInstalled


class CartPoleEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
    """
    ### Description

    This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in
    ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077).
    A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
    The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces
     in the left and right direction on the cart.

    ### Action Space

    The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction
     of the fixed force the cart is pushed with.

    | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |

    **Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
     the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

    ### Observation Space

    The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:

    | Num | Observation           | Min                 | Max               |
    |-----|-----------------------|---------------------|-------------------|
    | 0   | Cart Position         | -4.8                | 4.8               |
    | 1   | Cart Velocity         | -Inf                | Inf               |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
    | 3   | Pole Angular Velocity | -Inf                | Inf               |

    **Note:** While the ranges above denote the possible values for observation space of each element,
        it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
    -  The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates
       if the cart leaves the `(-2.4, 2.4)` range.
    -  The pole angle can be observed between  `(-.418, .418)` radians (or **±24°**), but the episode terminates
       if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)

    ### Rewards

    Since the goal is to keep the pole upright for as long as possible, a reward of `+1` for every step taken,
    including the termination step, is allotted. The threshold for rewards is 475 for v1.

    ### Starting State

    All observations are assigned a uniformly random value in `(-0.05, 0.05)`

    ### Episode Termination

    The episode terminates if any one of the following occurs:
    1. Pole Angle is greater than ±12°
    2. Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
    3. Episode length is greater than 500 (200 for v0)

    ### Arguments

    ```
    gym.make('CartPole-v1')
    ```

    No additional arguments are currently supported.
    """

    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 50}

    def __init__(self):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masspole + self.masscart
        self.length = 0.5  # actually half the pole's length
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02  # seconds between state updates
        self.kinematics_integrator = "euler"

        # Angle at which to fail the episode
        self.theta_threshold_radians = 12 * 2 * math.pi / 360
        self.x_threshold = 2.4

        # Angle limit set to 2 * theta_threshold_radians so failing observation
        # is still within bounds.
        high = np.array(
            [
                self.x_threshold * 2,
                np.finfo(np.float32).max,
                self.theta_threshold_radians * 2,
                np.finfo(np.float32).max,
            ],
            dtype=np.float32,
        )

        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(-high, high, dtype=np.float32)

        self.screen = None
        self.clock = None
        self.isopen = True
        self.state = None

        self.steps_beyond_done = None

    def step(self, action):
        err_msg = f"{action!r} ({type(action)}) invalid"
        assert self.action_space.contains(action), err_msg
        assert self.state is not None, "Call reset before using step method."
        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        # For the interested reader:
        # https://coneural.org/florian/papers/05_cart_pole.pdf
        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (
            self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned done = True. You "
                    "should always call 'reset()' once you receive 'done = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_done += 1
            reward = 0.0

        return np.array(self.state, dtype=np.float32), reward, done, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        return_info: bool = False,
        options: Optional[dict] = None,
    ):
        super().reset(seed=seed)
        self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
        self.steps_beyond_done = None
        if not return_info:
            return np.array(self.state, dtype=np.float32)
        else:
            return np.array(self.state, dtype=np.float32), {}

    def render(self, mode="human"):
        try:
            import pygame
            from pygame import gfxdraw
        except ImportError:
            raise DependencyNotInstalled(
                "pygame is not installed, run `pip install gym[classic_control]`"
            )

        screen_width = 600
        screen_height = 400

        world_width = self.x_threshold * 2
        scale = screen_width / world_width
        polewidth = 10.0
        polelen = scale * (2 * self.length)
        cartwidth = 50.0
        cartheight = 30.0

        if self.state is None:
            return None

        x = self.state

        if self.screen is None:
            pygame.init()
            pygame.display.init()
            self.screen = pygame.display.set_mode((screen_width, screen_height))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        self.surf = pygame.Surface((screen_width, screen_height))
        self.surf.fill((255, 255, 255))

        l, r, t, b = -cartwidth / 2, cartwidth / 2, cartheight / 2, -cartheight / 2
        axleoffset = cartheight / 4.0
        cartx = x[0] * scale + screen_width / 2.0  # MIDDLE OF CART
        carty = 100  # TOP OF CART
        cart_coords = [(l, b), (l, t), (r, t), (r, b)]
        cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
        gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))

        l, r, t, b = (
            -polewidth / 2,
            polewidth / 2,
            polelen - polewidth / 2,
            -polewidth / 2,
        )

        pole_coords = []
        for coord in [(l, b), (l, t), (r, t), (r, b)]:
            coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
            coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
            pole_coords.append(coord)
        gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
        gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))

        gfxdraw.aacircle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )
        gfxdraw.filled_circle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )

        gfxdraw.hline(self.surf, 0, screen_width, carty, (0, 0, 0))

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        if mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )
        else:
            return self.isopen

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
溪风沐雪 2022-06-08 14:12
关注
一般情况下reset()就是重新初始化环境，除非你需要获取初始化时产生的某些参数，那就要有返回值，如果仅仅是初始化的话，完全可以不设返回值，你这4个随机数就更没必要了

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决 1
无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

24、探索OpenAI Gym：强化学习的理想平台
2025-07-15 21:27

www00的博客本文介绍了OpenAI Gym，一个专为强化学习研究设计的理想平台。文章详细阐述了Gym的背景、特点和安装方法，并探讨了强化学习的基础概念及挑战。此外，还介绍了Gym环境类型、扩展可能性以及如何利用其进行高效的研究...
强化学习（三） - Gym库介绍和使用，Markov决策程序实例，动态规划决策实例
2020-08-16 08:16

Stan Fu的博客 Gym库是OpenAI推出的强化学习实验环境库，它用python语言实现了离散时间智能体/环境接口中的环境部分。除了依赖少量的商业库外，整个项目时开源免费的。 Gym库内置上百种实验环境，包括以下几类。算法环境：包括...
OpenAI Gym 经典控制环境介绍——CartPole（倒立摆）
2019-04-28 19:32

思绪无限的博客 OpenAI Gym是一款用于研发和比较强化学习算法的工具包，本文主要介绍Gym仿真环境的功能和工具包的使用方法，并详细介绍其中的经典控制问题中的倒立摆（CartPole-v0/1）问题。最后针对倒立摆问题如何建立控制模型并...
gym库文档学习（一）
2022-05-28 10:16

Cary.的博客最近老板突然让我编写一个自定义的强化学习环境，一头雾水（烦），没办法，硬着头皮啃官方文档咯~ 第一节先学习常用的API： 1 初始化环境在 Gym 中初始化环境非常简单，可以通过以下方式完成： import gym env...
Python Gymnasium（原OpenAI Gym）库详解：强化学习环境的完整使用指南与丰富示例程序
2026-02-24 18:59

prince_zxill的博客当内置环境不够用时，自定义是必备技能。Gymnasium提供完整模板。步骤编辑完整GridWorld示例（5x5网格，找目标，带渲染）：（代码见工具返回的完整版本，我在此翻译注释并精简说明）RIGHT = 0UP = 1LEFT = 2DOWN = 3...
强化学习环境设计：从接口角度的深度分析
2024-09-04 18:49

AI-星辰的博客本文深入探讨了强化学习环境的接口设计。核心接口包括reset()、step()、render()方法，以及action_space和observation_space属性。文章详细分析了这些接口的实现原则，强调了清晰性、一致性和可扩展性。通过灵活的...
【强化学习】gymnasium自定义环境并封装学习笔记
2024-06-12 16:39

几度热忱的博客【强化学习】gymnasium自定义环境并封装学习笔记 gym与gymnasium简介 gym gymnasium gymnasium的基本使用方法使用gymnasium封装自定义环境官方示例及代码编写环境文件 __init__()方法 reset()方法 step()方法 ...
OpenAI Gym强化学习问题的解决与实战方案
2025-08-26 10:14

作死专业户的博客 OpenAI Gym是由OpenAI团队开发的一个强大且灵活的工具包，旨在促进强化学习算法的研究与开发。自2016年发布以来，Gym已逐渐成为强化学习领域的重要平台，提供了一个统一的框架，用于测试和比较不同的学习算法。核心...
基于“蘑菇书”的强化学习知识点（十）：第二章的代码：simple_grid.py及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）（一）
2025-02-10 22:43

墨绿色的摆渡人的博客第二章的代码：value_iteration.ipynb及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）（一）
基于“蘑菇书”的强化学习知识点（十三）：第三章的代码：racetrack.py及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）（二）
2025-02-25 23:40

墨绿色的摆渡人的博客第三章的代码：racetrack.ipynb及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）
AI架构师必知必会系列：强化学习在金融领域的应用
2023-12-05 01:14

光子AI的博客在金融领域，如何制定最优决策以实现收益最大化和风险最小化一直是一个核心问题。传统的金融决策方法主要依赖于统计模型...近年来,随着人工智能技术的快速发展,强化学习作为一种智能决策方法受到了金融领域的广泛关注。
GRL-图强化学习
2024-07-28 23:02

芝士工具猿的博客图强化学习-原理与时间入门的代码解析
14、金融投资中的强化学习与可解释人工智能框架
2025-10-05 01:40

p5l2m9n4o6q的博客本文探讨了强化学习在金融投资组合分配中的应用，详细介绍了FinRL库的使用流程，包括数据获取、环境定义、模型训练与测试，并深入分析了奖励函数设计与超参数调优对策略的影响。同时，文章提出了一个可解释人工智能...
一文掌握基于深度学习的自动驾驶小车开发（Pytorch实现，含完整数据和源码，树莓派+神经计算棒）
2022-03-26 16:19

钱彬（Qian Bin）的博客运行下面的代码前先启动模拟器，并停留在模拟器主界面上）： # 导入库 import gym import gym_donkeycar import numpy as np import cv2 # 设置模拟器环境 env = gym.make("donkey-generated-roads-v0") # 重置当前...
51、复杂环境决策中的强化学习
2025-10-03 01:00

ol78901234的博客本文深入探讨了强化学习在复杂环境决策中的应用，系统介绍了动态规划、蒙特卡罗方法、时序差分学习及其分支算法（如SARSA和Q-学习）的核心原理与实现方式。文章结合贝尔曼方程阐述值函数与策略优化的关系，并通过...
强化学习：从直觉到实践，一文读懂人工智能的核心范式
2025-08-23 17:32

北辰alk的博客 强化学习：从直觉到实践，一文读懂人工智能的核心范式
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 6月17日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 6月9日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 6月8日

强化学习，gym.reset（）重置环境为什么不是返回一组为0 的数据，而是返回一定范围的数组？

1条回答 默认 最新

问题事件

1条回答默认最新