如何最好地存储聊天机器人的数据?
我在互联网上寻找聊天机器人。这只是有趣。但现在,我非常喜欢这个主题,所以我想开发自己的聊天机器人。
但第一件事是寻找一种好方法来管理我的聊天机器人的“大脑”。我认为将所有内容保存在 XML 文件中是最好的解决方案,不是吗?
这样文件类型就清楚了。涉及不同名词之间的关系等。当我有一个名词时,例如一棵树。如何才能最好地保存一棵树的叶子、树枝和根。一棵树需要水和阳光才能生存?
我应该这样保存还是以其他方式保存?
这将是我的此树的 XML - 示例:
<nouns>
<noun id="noun_0">
<name>tree</name>
<relationship>
<has>noun_1</has>
<has>noun_2</has>
<has>noun_3</has>
<need>noun_4</need>
<need>noun_5</need>
</relationship>
</noun>
<noun id="noun_1">
<name>root</name>
</noun>
<noun id="noun_2">
<name>branch</name>
<relationship>
<has>noun_3</has>
</relationship>
</noun>
<noun id="noun_3">
<name>leaf</name>
</noun>
<noun id="noun_4">
<name>water</name>
</noun>
<noun id="noun_5">
<name>light</name>
</noun>
. . .
</nouns>
I was looking on the internet for chatbots. It was only fun. But now, I love this subject so much that I want to develop my own chatbot.
But the first thing is to look for a good way to manage the "brain" of my chatbot. I think that it's the best solution to save everything in a XML file, isn't it?
So the file type is clear. Comes to the relationship between different nouns etc. When I have a noun, e.g. a tree. How do I save best that a tree has leaves, branches and roots. And that a tree needs water and sunlight to survive?
Should I save it like that or otherwise?
This would be my XML for this tree-example:
<nouns>
<noun id="noun_0">
<name>tree</name>
<relationship>
<has>noun_1</has>
<has>noun_2</has>
<has>noun_3</has>
<need>noun_4</need>
<need>noun_5</need>
</relationship>
</noun>
<noun id="noun_1">
<name>root</name>
</noun>
<noun id="noun_2">
<name>branch</name>
<relationship>
<has>noun_3</has>
</relationship>
</noun>
<noun id="noun_3">
<name>leaf</name>
</noun>
<noun id="noun_4">
<name>water</name>
</noun>
<noun id="noun_5">
<name>light</name>
</noun>
. . .
</nouns>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
数据存储选择:视情况而定
简单、非学习型机器人:XML 就可以
看起来您已经制定了基本的 XML 结构。对于刚开始的人来说,我认为这很好,特别是对于人工智能支持聊天类型的机器人(
if userMsg.contains('lega') then print('TOS & Copyright...'
)当然,切换到任何新格式都需要时间和开销
学习,复杂的机器人:数据库!
如果您想做更大的事情,特别是如果您有CleverBot 记住,我认为您将需要一个数据库,这是因为当您的文件 .. 是一个文件并且是时。对于这种项目,我建议使用一个数据库。
为什么英语很复杂
不久前我写了一个 nieve bayes 垃圾邮件分类器。大约花了10,000 条垃圾邮件以 7% 的准确率“训练”它,需要大约 6 个小时和 1.5GB RAM 来将数据保存在内存中,这是非常困难的,并且无法真正被破解。变成
if 'pony' then 'saddle'
,因此对于机器人来说,要“学习”最佳响应,您的数据库将变得庞大且非常快。Data Storage Choices: It Depends
Simple, non-learning bots: XML is fine
It looks like you already have a basic XML structure worked out. For just starting out, I'd say that's fine, especially for AI support-chat kind of bots (
if userMsg.contains('lega') then print('TOS & Copyright...'
).Of course, switching to any new format will take time and overhead.
Learning, Complicated bots: database!
If you're looking to do something much larger, especially if you have CleverBot in mind, I think you're going to need a database. This is because when your file .. is a file and is gigantic and trying to keep it all available in memory is resource intensive. For this kind of project, I'd recommend a database.
Why? English is Complicated
A while back I wrote a nieve bayes spam sorter. It took about 10,000 pieces of spam to "train" it at a 7% accuracy rate, which took about 6 hours and 1.5GB of RAM to hold the data in memory. That's a lot of data. English is very hard and can't really be broken into
if 'pony' then 'saddle'
, so for a bot to "learn" the best responses, your database is going to become massive and very quickly.我认为我们可以将这些信息建模为本体。您可以在关系、属性、级别等方面编码更丰富的信息。您可以使用 RDF、OWL 等格式,并且几乎所有语言都支持它们。
最重要的是,如果你使用本体编辑器,管理数据会很容易,我推荐Protege(http://protege.stanford.edu/),看看它。
I think we can model this information as an ontology. You can encode much richer information, in terms of relations, attributes, levels etc. There are formats like RDF, OWL etc. which you can use and are supported by almost all languages.
And most importantly, managing data would be be easy if you use an ontology editor , i would recommend Protege (http://protege.stanford.edu/), take a look at it.
您还可以尝试 Freebase 用来存储各种实体之间关系的 graphdb 之类的东西。基本上,它是一个由节点和边组成的图,每个节点都有属性和这些属性的值。边也具有与节点类似的属性,连接两个节点的边定义了它们之间的关系。
You could also try something like a graphdb that Freebase uses to store relations between various entities. Basically, it is a graph of nodes and edges, and each node has attributes and values for those attributes. The edges also have attributes similar to nodes and an edge connecting two nodes defines a relationship between them.
您可能正在查看数据库。任何严肃的 NLP 系统都会使用它,除非你有一个基于规则的东西,它运行在一小组规则上。考虑一下您是否想要编写一段 C 代码来处理 5 MB 的 xml 文件。我绝对不会。如果您对语言方面感兴趣,斯坦福大学举办了一场精彩的演示。
You are probably looking at a database. Any serious NLP system would be using one, unless you have a rule-based thing which operates on a small set of rules. Think about whether you would want to write a piece of C code that handles a 5 MB xml file. I would most definitely not. Stanford university host a nice demo if you are interested in the linguistic side of it.