强化学习 - 优化给定分数的权重
我正在开发一个项目,该项目有一个模拟机器人,通过移动到预定义的“传感位置”来探索未知但有图案的环境(例如办公楼)。换句话说,在每个点,机器人必须从可用的可见位置中选择一个要移动到的新位置。我们的最终目标是让机器人学习如何利用环境中的模式来优化全球探索时间。
机器人根据有关该位置的许多已知特征(例如到该点的距离、从该点到所有其他点的平均距离、该点周围的区域)的线性组合为其提供效用分数,从而选择下一个移动到的位置已经探索过,等等)。我的目标是优化该实用函数的权重,以最快的时间探索整个环境。
因为分数取决于整个探索路径,所以我不想在探索过程中改变权重。为了测试权重组合,我希望模拟机器人使用这些权重运行整个环境,并获得最终的分数。因此,我可以创建一个 |w|+1 xn 数据数组,其中 |w|是权重的数量,如下所示:
w1 w2 w3 w4 score
0.23, 4.30, -0.33, -2.001, 17030
-1.3, 2.03, -10.1, -0.021, 21983
3.65, -1.1, 5.021, 0.2301, 19508
etc...
我的问题是,哪种强化学习算法最适合这个?我在文献和研究中发现的大部分内容都与分类有关,显然多元回归不起作用。我还尝试实现 q 学习算法,但这并不能真正起作用,因为根据所采取的路径和环境结构,状态和操作的数量是可变的。我真正想要的是某种结构,它接受一行又一行的数据,并确定权重值及其组合,以最大化预期分数。有什么帮助/想法吗?谢谢。
I am working on a project that has a simulated robot exploring an unknown, but patterned environment (such as an office building) by moving around to predefined "sensing locations". In other words, at each point the robot must choose a new location to move to from the available visible locations. Our ultimate goal is to have the robot learn how to exploit the patterns in the environment to optimize global exploration time.
The robot chooses which location to move to next by giving it a utility score based on a linear combination of a number of known features about the location (such as distance to the point, average distance from the point to all others, area around the point already explored, etc.). My goal is to optimize the weights of this utility function to give the fastest time to explore the whole environment.
Because the score depends on the entire exploration path, I do not want to alter the weights mid-exploration. To test a combination of weights, I want the simulated robot to run through the entire environment with those weights, and get the resulting score. Therefore, I can create an |w|+1 x n array of data, where |w| is the number of weights, such as the following:
w1 w2 w3 w4 score
0.23, 4.30, -0.33, -2.001, 17030
-1.3, 2.03, -10.1, -0.021, 21983
3.65, -1.1, 5.021, 0.2301, 19508
etc...
My question is, what sort of reinforcement learning algorithm would be best for this? Most of what I find in the literature and my research has to do with classification, and obviously multivariate regression wont work. I also tried implementing a q-learning algorithm, but this does not really work as there are a variable number of states and actions depending on the path taken and the structure of the environment. What I really want is some sort of structure that takes in row after row of the data, and determines the values of weights and their combinations that maximize the expected score. Any help/ideas? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
形式化设置的方式(没有中间奖励,没有在线学习,只有最终分数)是典型的黑盒优化(或系统发育强化学习)。
合适的算法包括遗传算法、进化策略或随机搜索。一些最先进的算法有:
都有不同的风格,具体取决于您拥有的参数数量、分数的噪音程度以及您期望的局部最优值。
有关这些在 Python 中的实现的集合,请查看 PyBrain 库。
The way you formalize your setup (no intermediate rewards, no online learning, just a final score) is typical for black-box optimization (or phylogenetic reinforcement learning).
Among the appropriate algorithms are genetic algorithms, evolution strategies or stochastic search. Some state-of-the art algorithms are:
that each come in different flavors, depending on how many parameters you have, how noisy your score is and how many local optima you expect.
For a collection of implementations of these in Python, look at the PyBrain library.