使用遗传算法进行文档分类

发布于 2024-10-12 14:28:19 字数 568 浏览 8 评论 0原文

我的大学项目有点问题。

我必须使用遗传算法来实现文档分类。

我看过这个示例并且（可以说）理解了原理遗传算法，但我不确定它们如何在文档分类中实现。无法弄清楚适应度函数。

这是我到目前为止所想到的（它可能完全错误......）

接受我有类别并且每个类别都由一些关键字描述。
将文件拆分为单词。
从数组创建第一个总体（例如 100 个数组，但这取决于文件的大小），并填充文件中的随机单词。
1：
为人口中的每个孩子选择最佳类别（通过计算其中的关键字）。
交叉群体中的每 2 个孩子（新数组包含每个孩子的一半） - “交叉”
用文件中随机未使用的单词填充交叉留下的其余子项 - “进化？”
用文件中的随机单词（使用或未使用）替换新群体中随机子项中的随机单词 - “突变”
将最佳结果复制到新群体中。
转到 1，直到达到某个人口限制或找到某个类别足够多次，

我不确定这是否正确，并且很乐意提供一些建议，伙计们。
非常感谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久隐师 2024-10-19 14:28:19

Ivane，为了正确地将 GA 应用于文档分类：

您必须将问题简化为可以演化的组件系统。
您无法对单个文档进行文档分类的 GA 训练。

因此，您所描述的步骤是正确的，但我会给您一些改进：

拥有足够数量的训练数据：您需要一组已分类且足够多样化的文档，以涵盖以下范围：您可能会遇到的文档。
训练您的 GA 正确分类这些文档的子集，即训练数据集。
在每一代中，根据验证数据集测试您的最佳样本，如果验证准确性开始下降，则停止训练。

所以你想做的是：

prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;

while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
    prevValidationFitness = currentValidationFitness;

    // Randomly generate a population of GAs
    population[] = randomlyGenerateGAs();

    // Train your population on the training data set
    bestGA = Train(population);

    // Get the validation fitness fitness of the best GA 
    currentValidationFitness = Validate(bestGA);

    // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
    selection[] = makeSelection(population);

    // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
    population = mate(selection);
}

每当你得到一个新文档（之前没有分类过的文档）时，你现在可以用最好的 GA 对其进行分类：

category = bestGA.Classify(document);

所以这不是最终的解决方案，而是它应该给你一个良好的开始。
波兹德拉维,
基里尔

Ivane, in order to properly apply GA's to document classification:

You have to reduce the problem to a system of components that can be evolved.
You can't do GA training for document classification on a single document.

So the steps that you've described are on the right track, but I'll give you some improvements:

Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.

So what you want to do is:

prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;

while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
    prevValidationFitness = currentValidationFitness;

    // Randomly generate a population of GAs
    population[] = randomlyGenerateGAs();

    // Train your population on the training data set
    bestGA = Train(population);

    // Get the validation fitness fitness of the best GA 
    currentValidationFitness = Validate(bestGA);

    // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
    selection[] = makeSelection(population);

    // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
    population = mate(selection);
}

Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA: