强化学习 - 优化给定分数的权重

发布于 2024-10-25 23:41:58 字数 720 浏览 10 评论 0原文

我正在开发一个项目，该项目有一个模拟机器人，通过移动到预定义的“传感位置”来探索未知但有图案的环境（例如办公楼）。换句话说，在每个点，机器人必须从可用的可见位置中选择一个要移动到的新位置。我们的最终目标是让机器人学习如何利用环境中的模式来优化全球探索时间。

机器人根据有关该位置的许多已知特征（例如到该点的距离、从该点到所有其他点的平均距离、该点周围的区域）的线性组合为其提供效用分数，从而选择下一个移动到的位置已经探索过，等等）。我的目标是优化该实用函数的权重，以最快的时间探索整个环境。

因为分数取决于整个探索路径，所以我不想在探索过程中改变权重。为了测试权重组合，我希望模拟机器人使用这些权重运行整个环境，并获得最终的分数。因此，我可以创建一个 |w|+1 xn 数据数组，其中 |w|是权重的数量，如下所示：

w1    w2    w3     w4      score
0.23, 4.30, -0.33, -2.001, 17030
-1.3, 2.03, -10.1, -0.021, 21983
3.65, -1.1, 5.021, 0.2301, 19508
etc...

我的问题是，哪种强化学习算法最适合这个？我在文献和研究中发现的大部分内容都与分类有关，显然多元回归不起作用。我还尝试实现 q 学习算法，但这并不能真正起作用，因为根据所采取的路径和环境结构，状态和操作的数量是可变的。我真正想要的是某种结构，它接受一行又一行的数据，并确定权重值及其组合，以最大化预期分数。有什么帮助/想法吗？谢谢。

原文

I am working on a project that has a simulated robot exploring an unknown, but patterned environment (such as an office building) by moving around to predefined "sensing locations". In other words, at each point the robot must choose a new location to move to from the available visible locations. Our ultimate goal is to have the robot learn how to exploit the patterns in the environment to optimize global exploration time.

The robot chooses which location to move to next by giving it a utility score based on a linear combination of a number of known features about the location (such as distance to the point, average distance from the point to all others, area around the point already explored, etc.). My goal is to optimize the weights of this utility function to give the fastest time to explore the whole environment.

Because the score depends on the entire exploration path, I do not want to alter the weights mid-exploration. To test a combination of weights, I want the simulated robot to run through the entire environment with those weights, and get the resulting score. Therefore, I can create an |w|+1 x n array of data, where |w| is the number of weights, such as the following:

w1    w2    w3     w4      score
0.23, 4.30, -0.33, -2.001, 17030
-1.3, 2.03, -10.1, -0.021, 21983
3.65, -1.1, 5.021, 0.2301, 19508
etc...

My question is, what sort of reinforcement learning algorithm would be best for this? Most of what I find in the literature and my research has to do with classification, and obviously multivariate regression wont work. I also tried implementing a q-learning algorithm, but this does not really work as there are a variable number of states and actions depending on the path taken and the structure of the environment. What I really want is some sort of structure that takes in row after row of the data, and determines the values of weights and their combinations that maximize the expected score. Any help/ideas? Thanks.

分享到QQ

分享到微博