使用遗传算法进行文档分类

发布于 2024-10-12 14:28:19 字数 568 浏览 8 评论 0原文

我的大学项目有点问题。

我必须使用遗传算法来实现文档分类。

我看过这个示例并且(可以说)理解了原理遗传算法,但我不确定它们如何在文档分类中实现。无法弄清楚适应度函数。

这是我到目前为止所想到的(它可能完全错误......)

接受我有类别并且每个类别都由一些关键字描述。
将文件拆分为单词。
从数组创建第一个总体(例如 100 个数组,但这取决于文件的大小),并填充文件中的随机单词。
1:
为人口中的每个孩子选择最佳类别(通过计算其中的关键字)。
交叉群体中的每 2 个孩子(新数组包含每个孩子的一半) - “交叉”
用文件中随机未使用的单词填充交叉留下的其余子项 - “进化?”
用文件中的随机单词(使用或未使用)替换新群体中随机子项中的随机单词 - “突变”
将最佳结果复制到新群体中。
转到 1,直到达到某个人口限制或找到某个类别足够多次,

我不确定这是否正确,并且很乐意提供一些建议,伙计们。
非常感谢!

I have a bit of a problem with my project for the university.

I have to implement document classification using genetic algorithm.

I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.

Here is what I've managed to think of so far (Its probably totally wrong...)

Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times

I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

久隐师 2024-10-19 14:28:19

Ivane,为了正确地将 GA 应用于文档分类:

  1. 您必须将问题简化为可以演化的组件系统。
  2. 您无法对单个文档进行文档分类的 GA 训练。

因此,您所描述的步骤是正确的,但我会给您一些改进:

  • 拥有足够数量的训练数据:您需要一组已分类且足够多样化的文档,以涵盖以下范围:您可能会遇到的文档。
  • 训练您的 GA 正确分类这些文档的子集,即训练数据集。
  • 在每一代中,根据验证数据集测试您的最佳样本,如果验证准确性开始下降,则停止训练。

所以你想做的是:

prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;

while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
    prevValidationFitness = currentValidationFitness;

    // Randomly generate a population of GAs
    population[] = randomlyGenerateGAs();

    // Train your population on the training data set
    bestGA = Train(population);

    // Get the validation fitness fitness of the best GA 
    currentValidationFitness = Validate(bestGA);

    // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
    selection[] = makeSelection(population);

    // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
    population = mate(selection);
}

每当你得到一个新文档(之前没有分类过的文档)时,你现在可以用最好的 GA 对其进行分类:

category = bestGA.Classify(document);

所以这不是最终的解决方案,而是它应该给你一个良好的开始。
波兹德拉维,
基里尔

Ivane, in order to properly apply GA's to document classification:

  1. You have to reduce the problem to a system of components that can be evolved.
  2. You can't do GA training for document classification on a single document.

So the steps that you've described are on the right track, but I'll give you some improvements:

  • Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
  • Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
  • At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.

So what you want to do is:

prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;

while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
    prevValidationFitness = currentValidationFitness;

    // Randomly generate a population of GAs
    population[] = randomlyGenerateGAs();

    // Train your population on the training data set
    bestGA = Train(population);

    // Get the validation fitness fitness of the best GA 
    currentValidationFitness = Validate(bestGA);

    // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
    selection[] = makeSelection(population);

    // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
    population = mate(selection);
}

Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:

category = bestGA.Classify(document);

So this is not the end-all-be-all solution, but it should give you a decent start.
Pozdravi,
Kiril

我只土不豪 2024-10-19 14:28:19

您可能会发现学习分类器系统有用/有趣。 LCS 是一种用于解决分类问题的进化算法。 《Eiben & 》中有一个关于他们的章节。史密斯的进化计算简介

You might find Learning Classifier Systems useful/interesting. An LCS is a type of evolutionary algorithm intended for classification problems. There is a chapter about them in Eiben & Smith's Introduction to Evolutionary Computing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文