使用遗传算法进行文档分类
我的大学项目有点问题。
我必须使用遗传算法来实现文档分类。
我看过这个示例并且(可以说)理解了原理遗传算法,但我不确定它们如何在文档分类中实现。无法弄清楚适应度函数。
这是我到目前为止所想到的(它可能完全错误......)
接受我有类别并且每个类别都由一些关键字描述。
将文件拆分为单词。
从数组创建第一个总体(例如 100 个数组,但这取决于文件的大小),并填充文件中的随机单词。
1:
为人口中的每个孩子选择最佳类别(通过计算其中的关键字)。
交叉群体中的每 2 个孩子(新数组包含每个孩子的一半) - “交叉”
用文件中随机未使用的单词填充交叉留下的其余子项 - “进化?”
用文件中的随机单词(使用或未使用)替换新群体中随机子项中的随机单词 - “突变”
将最佳结果复制到新群体中。
转到 1,直到达到某个人口限制或找到某个类别足够多次,
我不确定这是否正确,并且很乐意提供一些建议,伙计们。
非常感谢!
I have a bit of a problem with my project for the university.
I have to implement document classification using genetic algorithm.
I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.
Here is what I've managed to think of so far (Its probably totally wrong...)
Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times
I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Ivane,为了正确地将 GA 应用于文档分类:
因此,您所描述的步骤是正确的,但我会给您一些改进:
所以你想做的是:
每当你得到一个新文档(之前没有分类过的文档)时,你现在可以用最好的 GA 对其进行分类:
所以这不是最终的解决方案,而是它应该给你一个良好的开始。
波兹德拉维,
基里尔
Ivane, in order to properly apply GA's to document classification:
So the steps that you've described are on the right track, but I'll give you some improvements:
So what you want to do is:
Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:
So this is not the end-all-be-all solution, but it should give you a decent start.
Pozdravi,
Kiril
您可能会发现学习分类器系统有用/有趣。 LCS 是一种用于解决分类问题的进化算法。 《Eiben & 》中有一个关于他们的章节。史密斯的进化计算简介。
You might find Learning Classifier Systems useful/interesting. An LCS is a type of evolutionary algorithm intended for classification problems. There is a chapter about them in Eiben & Smith's Introduction to Evolutionary Computing.