...gym强化学习的车杆/FlappyBird游戏_aohun0743的博客-CSDN博客-免疫在线蚂蚁淘旗下平台-

当前位置：首页 > 新闻动态 >

热卖商品

HunaNature/Emodin/518-82-1

HunaNature/Icariin/489-32-7

HunaNature/Astragaloside IV/84687-43-4

HunaNature/Puerarin/3681-99-0

新闻详情

...gym强化学习的车杆/FlappyBird游戏_aohun0743的博客-CSDN博客

来自 : CSDN技术社区发布时间：2021-03-25

# 为了使Agent在长期运行中表现得更好 不仅仅需要考虑即时回报(immediate rewards) 还要考虑未来回报(future rewards)。为了实现这一目标 定义discount rate折扣因子(即gamma)。这样 Agent将学习已有的状态然后想方设法最大化未来回报
for i in batches: # Extract informations from i-th index of the memory state, action, reward, next_state self.memory[i] # if done, make our target reward (-100 penality) target reward if not done: # predict the future discounted reward target reward self.gamma * np.amax(self.model.predict(next_state)[0]) # make the agent to approximately map # the current state to future discounted reward # We ll call that target_f target_f self.model.predict(state) target_f[0][action] target # Train the Neural Net with the state and target_f self.model.fit(state, target_f, nb_epoch 1, verbose 0)

Agent选择行为 Agent在最初一段时间会随机选择行为由exploration rate或epsilon参数表征。这是因为在最初对Agent最好的策略就是在其掌握模式前尝试一切。当Agent没有随机选择行为它会基于当前状态预测回报值选择能够实现回报最大化的行为

# np.argmax()函数可以取出act_values[0]中的最大值def act(self, state): if np.random.rand() self.epsilon: # The agent acts randomly return env.action_space.sample() # Predict the reward value based on the given state act_values self.model.predict(state) # Pick the action based on the predicted reward return np.argmax(act_values[0])
# act_values[0]中的数据类似[0.67, 0.2] 每个数字分别代表0和1的回报 于是argmax()会取出更大数值所代表的的行为。比如在[0.67, 0.2]中 argmax()返回0因为0索引代表的数据的回报最大

超参数 - 强化学习Agent所必需的部分超参数

·episodes 让Agent玩游戏的次数·gamma discount rate 折扣因子 以便计算未来的折扣回报·epsilon exploration rate 表征一个Agent随机选择行为的程度(比率)·epsilon_decay 上述参数的衰减率 使得随着Agent更擅长游戏的同时减少它探索的次数·epsilon_min 希望Agent采取的最少的探索次数·learning_rata 决定神经网络在每次迭代时的学习率(学习程度)

设计深度Q学习Agent - DQNAgent

# Deep-Q learning Agentclass DQNAgent: def __init__(self, env): self.env env self.memory [] self.gamma 0.9 # decay rate self.epsilon 1 # exploration self.epsilon_decay .995 self.epsilon_min 0.1 self.learning_rate 0.0001 self._build_model() def _build_model(self): model Sequential() model.add(Dense(128, input_dim 4, activation tanh )) model.add(Dense(128, activation tanh )) model.add(Dense(128, activation tanh )) model.add(Dense(2, activation linear )) model.compile(loss mse , optimizer RMSprop(lr self.learning_rate)) self.model model def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.rand() self.epsilon: return env.action_space.sample() act_values self.model.predict(state) return np.argmax(act_values[0]) # returns action def replay(self, batch_size): batches min(batch_size, len(self.memory)) batches np.random.choice(len(self.memory), batches) for i in batches: state, action, reward, next_state, done self.memory[i] target reward if not done: target reward self.gamma * \\ np.amax(self.model.predict(next_state)[0]) target_f self.model.predict(state) target_f[0][action] target self.model.fit(state, target_f, nb_epoch 1, verbose 0) if self.epsilon self.epsilon_min: self.epsilon * self.epsilon_decay

训练DQNAgent

if __name__ __main__ : # 为Agent初始化gym环境参数 env gym.make( CartPole-v0 ) agent DQNAgent(env) # 游戏的主循环 for e in range(episodes): # 在每次游戏开始时复位状态参数 state env.reset() state np.reshape(state, [1, 4]) # time_t 代表游戏的每一帧 # 我们的目标是使得杆子尽可能长地保持竖直朝上 # time_t 越大 分数越高 for time_t in range(5000): # turn this on if you want to render # env.render() # 选择行为 action agent.act(state) # 在环境中施加行为推动游戏进行 next_state, reward, done, _ env.step(action) next_state np.reshape(next_state, [1, 4]) # reward缺省为1 # 在每一个Agent完成了目标的帧Agent都会得到回报 # 并且如果失败得到-100 reward -100 if done else reward # 记忆先前的状态 行为 回报与下一个状态 agent.remember(state, action, reward, next_state, done) # 使下一个状态成为下一帧的新状态 state copy.deepcopy(next_state) # 如果游戏结束done被置为ture # 除非Agent没有完成目标 if done: # 打印分数并且跳出游戏循环 print( episode: {}/{}, score: {} .format(e, episodes, time_t)) break # 通过之前的经验训练模型 agent.replay(32)

结果

【探索】

Agent通过随机行为探索游戏环境

$\"\"$

【训练】

算法会经过多个阶段训练Agent

1.小车操作Agent试图平衡杆子
2.但是出界游戏结束
3.当它距离边界太近时它不得不移动小车于是杆子掉了
4.Agent最后掌握了平衡并学会控制杆子

经过几百个episodes的训练后它开始学习如何最大化分数

$\"\"$

使用Keras与Gym环境基于Nature-DQN玩CartPole游戏

Blog https://www.jianshu.com/p/e037d42ab6b1

Github https://github.com/xiaochus/Deep-Reinforcement-Learning-Practice

Nature DQN

DQN使用单个网络来进行选择动作和计算目标Q值 Nature DQN使用了两个网络一个当前主网络用来选择动作更新模型参数另一个目标网络用于计算目标Q值两个网络的结构是一模一样的。目标网络的网络参数不需要迭代更新而是每隔一段时间从当前主网络复制过来即延时更新这样可以减少目标Q值和当前的Q值相关性。Nature DQN和DQN相比除了用一个新的相同结构的目标网络来计算目标Q值以外其余部分基本是完全相同的。

实现流程
1 首先构建神经网络一个主网络一个目标网络他们的输入都为obervation 输出为不同action对应的Q值。
2 在一个episode结束时游戏胜利或死亡将env重置即observation恢复到了初始状态observation 通过贪婪选择法ε-greedy选择action。根据选择的action 获取到新的next_observation、reward和游戏状态。将[observation, action, reward, next_observation, done]放入到经验池中。经验池有一定的容量会将旧的数据删除。
3 从经验池中随机选取batch个大小的数据计算出observation的Q值作为Q_target。对于done为False的数据使用reward和next_observation计算discount_reward。然后将discount_reward更新到Q_traget中。
4 每一个action进行一次梯度下降更新使用MSE作为损失函数。注意与DPG不同参数更新不是发生在每次游戏结束而是发生在游戏进行中的每一步。
5 每个batch我们更新参数epsilon egreedy的epsilon是不断变小的也就是随机性不断变小。
6 每隔固定的步数从主网络中复制参数到目标网络。

使用keras实现Nature DQN

# -*- coding: utf-8 -*-import osimport gymimport randomimport numpy as npfrom collections import dequefrom keras.layers import Input, Densefrom keras.models import Modelfrom keras.optimizers import Adamimport keras.backend as K

本文链接： http://hunanature.immuno-online.com/view-775249.html

发布于： 2021-03-25 阅读（0）

没有了