构建 NetHack 机器人:贝叶斯分析是一个好的策略吗?
我的一个朋友正在开始构建一个 NetHack 机器人(玩 Roguelike 游戏的机器人:NetHack)。类似的游戏《Angband》有一个非常好的工作机器人,但它的工作部分是因为回到城镇很容易,并且总是能够在低水平上获得物品。
在 NetHack 中,这个问题要困难得多,因为该游戏奖励大胆的实验,并且基本上是由 1,000 个边缘案例构建的。
最近我建议使用某种朴素贝叶斯分析,其方式与创建垃圾邮件的方式非常相似。
基本上,机器人首先会构建一个语料库,通过尝试对其找到的每个物品或生物进行所有可能的操作,并存储该信息,例如,它距离死亡、伤害或负面影响有多近。随着时间的推移,您似乎可以生成一个相当可玩的模型。
谁能为我们指出一个好的开始应该是正确的方向吗?我是否找错了树或误解了贝叶斯分析的思想?
编辑:我的朋友发布了他的 NetHack 补丁的 github 存储库< /a> 允许 python 绑定。它仍然处于非常原始的状态,但如果有人感兴趣......
A friend of mine is beginning to build a NetHack bot (a bot that plays the Roguelike game: NetHack). There is a very good working bot for the similar game Angband, but it works partially because of the ease in going back to the town and always being able to scum low levels to gain items.
In NetHack, the problem is much more difficult, because the game rewards ballsy experimentation and is built basically as 1,000 edge cases.
Recently I suggested using some kind of naive bayesian analysis, in very much the same way spam is created.
Basically the bot would at first build a corpus, by trying every possible action with every item or creature it finds and storing that information with, for instance, how close to a death, injury of negative effect it was. Over time it seems like you could generate a reasonably playable model.
Can anyone point us in the right direction of what a good start would be? Am I barking up the wrong tree or misunderstanding the idea of bayesian analysis?
Edit: My friend put up a github repo of his NetHack patch that allows python bindings. It's still in a pretty primitive state but if anyone's interested...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尽管贝叶斯分析包含的内容更多,但垃圾邮件过滤器中众所周知的朴素贝叶斯算法基于一个非常基本的假设:所有变量本质上都是相互独立的。例如,在垃圾邮件过滤中,每个单词通常被视为一个变量,因此这意味着假设如果电子邮件包含单词“viagra”,那么该知识确实会影响它也包含单词“medicine”(或“foo”)的概率”或“垃圾邮件”或其他任何内容)。有趣的是,当涉及到自然语言时,这个假设显然是错误的,但仍然能够产生合理的结果。
现在,人们有时绕过独立性假设的一种方法是定义技术上是事物组合的变量(例如搜索标记“购买伟哥”)。如果您知道要查找的特定案例,那么这可能会起作用,但一般来说,在游戏环境中,这意味着您通常无法记住任何内容。因此,每次您必须移动、执行某个操作等时,它都完全独立于您迄今为止所做的任何其他操作。我想说,即使是最简单的游戏,这也是一种非常低效的学习游戏的方式。
我建议考虑使用 q-learning 来代替。您会发现的大多数示例通常只是简单的游戏(例如学习在地图中导航,同时避开墙壁、陷阱、怪物等)。强化学习是一种在线无监督学习,在可以建模为与环境交互的代理(例如游戏(或机器人))的情况下效果非常好。它试图找出环境中每个状态下的最佳操作是什么(每个状态可以根据需要包含尽可能多的变量,而不仅仅是“我在哪里”)。诀窍是保持足够的状态,以帮助机器人做出正确的决策,而无需在状态“空间”中为先前操作的每种可能的组合设置一个明显的点。
更具体地说,如果您要构建一个国际象棋机器人,如果您尝试创建一个基于所有先前棋步做出决策的决策策略,那么您可能会遇到麻烦,因为国际象棋棋步的所有可能组合的集合增长得非常快。即使是棋盘上每个棋子所在位置的更简单模型仍然是一个非常大的状态空间,因此您必须找到一种方法来简化您跟踪的内容。但请注意,您确实需要跟踪某些状态,这样您的机器人就不会一遍又一遍地尝试将左项放入墙上。
维基百科文章非常繁重,但是这篇教程 在将概念转化为现实世界的示例方面做得更好。
一个问题是您确实需要能够将奖励定义为积极的“强化”。也就是说,您需要能够定义机器人试图达到的状态,否则它将永远持续下去。
Although Bayesian analysis encompasses much more, the Naive Bayes algorithm well known from spam filters is based on one very fundamental assumption: all variables are essentially independent of each other. So for instance, in spam filtering each word is usually treated as a variable so this means assuming that if the email contains the word 'viagra', that knowledge does affect the probability that it will also contain the word 'medicine' (or 'foo' or 'spam' or anything else). The interesting thing is that this assumption is quite obviously false when it comes to natural language but still manages to produce reasonable results.
Now one way people sometimes get around the independence assumption is to define variables that are technically combinations of things (like searching for the token 'buy viagra'). That can work if you know specific cases to look for but in general, in a game environment, it means that you can't generally remember anything. So each time you have to move, perform an action, etc, its completely independent of anything else you've done so far. I would say for even the simplest games, this is a very inefficient way to go about learning the game.
I would suggest looking into using q-learning instead. Most of the examples you'll find are usually just simple games anyway (like learning to navigate a map while avoiding walls, traps, monsters, etc). Reinforcement learning is a type of online unsupervised learning that does really well in situations that can be modeled as an agent interacting with an environment, like a game (or robots). It does this trying to figure out what the optimal action is at each state in the environment (where each state can include as many variables as needed, much more than just 'where am i'). The trick then is maintain just enough state that helps the bot make good decisions without having a distinct point in your state 'space' for every possible combination of previous actions.
To put that in more concrete terms, if you were to build a chess bot you would probably have trouble if you tried to create a decision policy that made decisions based on all previous moves since the set of all possible combinations of chess moves grows really quickly. Even a simpler model of where every piece is on the board is still a very large state space so you have to find a way to simplify what you keep track of. But notice that you do get to keep track of some state so that your bot doesn't just keep trying to make a left term into a wall over and over again.
The wikipedia article is pretty jargon heavy but this tutorial does a much better job translating the concepts into real world examples.
The one catch is that you do need to be able to define rewards to provide as the positive 'reinforcement'. That is you need to be able to define the states that the bot is trying to get to, otherwise it will just continue forever.
这是有先例的:可怕的 rog-o-matic 程序成功地进行了流氓行为,甚至几次带着 Yendor 的护身符回来了。不幸的是,rogue 仅发布了二进制文件,而不是源代码,因此它已经死亡(除非您可以在 MicroVAX 上设置 4.3BSD 系统),从而使 rog-o-matic 无法玩任何克隆。它只是挂起,因为它们的模拟不够接近。
然而,我认为 rog-o-matic 是我一直以来最喜欢的程序,不仅因为它所取得的成就,还因为代码的可读性及其算法的可理解智能。它使用“基因继承”:新玩家将从前一对成功玩家那里继承偏好组合,并进行一些随机偏移,然后与机器进行对抗。更成功的偏好会在基因库中向上迁移,而不太成功的偏好会向下迁移。
如今,源代码很难找到,但搜索“rogomatic”将为您指明道路。
There is precedent: the monstrous rog-o-matic program succeeded in playing rogue and even returned with the amulet of Yendor a few times. Unfortunately, rogue was only released an a binary, not source, so it has died (unless you can set up a 4.3BSD system on a MicroVAX), leaving rog-o-matic unable to play any of the clones. It just hangs cos they're not close enough emulations.
However, rog-o-matic is, I think, my favourite program of all time, not only because of what it achieved but because of the readability of the code and the comprehensible intelligence of its algorithms. It used "genetic inheritance": a new player would inherit a combination of preferences from a previous pair of successful players, with some random offset, then be pitted against the machine. More successful preferences would migrate up in the gene pool and less successful ones down.
The source can be hard to find these days, but searching "rogomatic" will set you on the path.
我怀疑贝叶斯分析能否让您走得更远,因为 NetHack 的大部分内容都是高度上下文相关的。很少有行为总是都是坏主意。大多数在“正确”的情况下也是救生员(一个极端的例子是吃鸡蛇:这很糟糕,除非你正在挨饿并且目前变形为抗石头的怪物,在这种情况下吃鸡蛇是正确的做法)。其中一些“几乎很糟糕”的行为是赢得游戏所必需的(例如,爬上第一层的楼梯,或者故意掉进陷阱才能到达地狱)。
您可以尝试在“元”级别上进行操作。将机器人设计为在各种“基本行为”中随机选择。然后尝试衡量这些机器人的表现。然后提取看似促进生存的行为组合;贝叶斯分析可以在广泛的游戏库及其“成功水平”中做到这一点。例如,如果存在“拿起匕首”和“避免在近战中与怪物交战”的行为,我会假设分析表明这两种行为非常吻合:机器人在不使用匕首的情况下拿起匕首,而机器人则试图在没有收集到此类导弹的情况下向怪物扔导弹,情况可能会更糟。
这在某种程度上模仿了学习游戏玩家在rec.games.roguelike.nethack 中经常要求的内容。大多数问题类似于:“我应该喝未知的药水来识别它们吗?”或者“在深入地牢之前我的角色应该达到什么级别?”。这些问题的答案在很大程度上取决于玩家正在做什么,并且没有好的绝对答案。
这里的一个难点是如何衡量生存的成功。如果你只是想最大限度地延长死亡前的时间,那么你会喜欢那些永远不会离开第一关的机器人;这些人可能会活得很长,但永远不会赢得比赛。如果你通过角色死前的深度来衡量成功,那么最好的机器人将是疯狂挖掘的考古学家(他们从镐开始)。
I doubt bayesian analysis will get you far because most of NetHack is highly contextual. There are very few actions which are always a bad idea; most are also life-savers in the "right" situation (an extreme example is eating a cockatrice: that's bad, unless you are starving and currently polymorphed into a stone-resistant monster, in which case eating the cockatrice is the right thing to do). Some of those "almost bad" actions are required to win the game (e.g. coming up the stairs on level 1, or deliberately falling in traps to reach Gehennom).
What you could try would be trying to do it at the "meta" level. Design the bot as choosing randomly among a variety of "elementary behaviors". Then try to measure how these bots fare. Then extract the combinations of behaviors which seem to promote survival; bayesian analysis could do that among a wide corpus of games along with their "success level". For instance, if there are behaviors "pick up daggers" and "avoid engaging monsters in melee", I would assume that analysis would show that those two behaviors fit well together: bots which pick daggers up without using them, and bots which try to throw missiles at monsters without gathering such missiles, will probably fare worse.
This somehow mimics what learning gamers often ask for in rec.games.roguelike.nethack. Most questions are similar to: "should I drink unknown potions to identify them ?" or "what level should be my character before going that deep in the dungeon ?". Answers to those questions heavily depend on what else the player is doing, and there is no good absolute answer.
A difficult point here is how to measure the success at survival. If you simply try to maximize the time spent before dying, then you will favor bots which never leave the first levels; those may live long but will never win the game. If you measure success by how deep the character goes before dying then the best bots will be archeologists (who start with a pick-axe) in a digging frenzy.
显然,那里有大量的 Nethack 机器人。 查看此列表:
Apparently there are a good number of Nethack bots out there. Check out this listing:
在 nethack 中,未知的操作通常会产生布尔效应——要么获得,要么失去。贝叶斯网络基于“模糊逻辑”值——一个动作可能会以给定的概率带来增益。因此,您不需要贝叶斯网络,只需要“已发现的效应”列表以及它们是好还是坏。
不用再吃鸡蛇了吧?
总而言之,这取决于您作为初学者想要为机器人提供多少“知识”。你想让他“通过艰难的方式”学习一切,还是会给他剧透,直到他吃饱为止?
In nethack unknown actions usually have a boolean effect -- either you gain or you loose. Bayesian networks base around "fuzzy logic" values -- an action may give a gain with a given probability. Hence, you don't need a bayesian network, just a list of "discovered effects" and wether they are good or bad.
No need to eat the Cockatrice again, is there?
All in all it depends how much "knowledge" you want to give the bot as starters. Do you want him to learn everything "the hard way", or will you feed him spoilers 'till he's stuffed?