当前位置：文江博客话题详情

根据关键字组对文本进行分类？

发布于 2024-08-06 12:29:57 字数 511 浏览 11 评论 0原文

我有一个软件项目的需求列表，是根据其前身的剩余部分汇总的。每项要求都应映射到一个或多个类别。每个类别都包含一组关键字。我想做的是找到一种算法，它可以给我一个分数，对每个需求可能属于的类别进行排名。结果将用作进一步对需求进行分类的起点。

举个例子，假设我有这样的需求：

系统将存款存入客户指定的账户。

和类别/关键字：

客户交易：存款、存款、客户、账户、账户
余额账户：账户、账户、借方、贷方
其他类别：foo、bar

我希望算法在类别 1 中对要求评分最高，在类别中评分较低2，并且根本不在类别 3 中。评分机制与我基本无关，但需要传达类别 1 比类别 2 更适用的可能性。

我是 NLP 新手，所以我有点不知所措。我一直在阅读《Python 中的自然语言处理》，并希望应用其中的一些概念，但还没有看到任何非常合适的内容。我认为简单的频率分布不起作用，因为我正在处理的文本非常小（单个句子）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓦然回首 2024-08-13 12:29:57

您可能想查看“相似性度量”或“距离度量”的类别（在数据挖掘术语中，这与“分类”不同。）

基本上，相似性度量是一种数学方法，您可以：

采用两组数据（在你的情况下，单词）
做一些计算/方程/算法
结果是你有一些数字告诉你该数据有多“相似”。

通过相似性度量，这个数字是 0 到 1 之间的数字，其中“0”表示“根本没有匹配”，“1”表示“相同”，

因此您实际上可以将句子视为一个向量 - 并且您的句子中的每个单词句子代表该向量的一个元素。每个类别的关键字列表也是如此。

然后你可以做一些非常简单的事情：取“余弦相似度”或“Jaccard 指数"（取决于您构建数据的方式。）

这两个指标的作用是它们采用两个向量（您输入的句子和您的“关键字”列表）并给您一个数字。如果您在所有类别中执行此操作，则可以对这些数字进行排名，以查看哪个匹配具有最大的相似系数。

举个例子：

从你的问题来看：

客户交易：存款、
存款、客户、账户、账户

因此，您可以构造一个包含 5 个元素的向量：(1, 1, 1, 1, 1)。这意味着，对于“客户交易”关键字，您有 5 个单词，并且（这听起来很明显）每个单词都出现在您的搜索字符串中。和我在一起。

所以现在你接受你的句子：

系统应将存款存入
客户指定的帐户。

这有“客户交易”集中的 2 个单词：{deposits, account, customer}

（实际上，这说明了另一个细微差别：您实际上有“customer's”。这相当于“customer”吗？）

您的句子的向量可能是(1, 0, 1, 1, 0)

该向量中的 1 与第一个向量中的 1 的位置相同 - 因为这些单词是相同的。

所以我们可以说：这些向量有多少次不同？让我们比较一下：

(1,1,1,1,1)
(1,0,1,1,0)

嗯。他们在第 1、第 3 和第 4 个位置上有 3 次相同的“位”。它们仅相差 2 位。所以可以说，当我们比较这两个向量时，我们的“距离”为 2。恭喜，我们刚刚计算了汉明距离！汉明距离越低，数据越“相似”。

（“相似性”度量和“距离”度量之间的区别在于前者是标准化的 - 它为您提供 0 到 1 之间的值。距离可以是任意数字，因此它只为您提供相对值。

），这可能不是进行自然语言处理的最佳方法，但就您的目的而言，它是最简单的，并且实际上可能非常适合您的应用程序，或者至少作为一个起点。

（PS：“分类”——正如你在标题中所提到的——将回答这个问题“如果你接受我的句子，它最有可能属于哪一类？”这与说“有多相似”有点不同我的句子属于第 1 类而不是第 2 类？”这似乎就是你所追求的。）

祝你好运！

You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)

Basically, a similarity measure is a way in math you can:

Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.

With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"

So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.

And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)

What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.

As an example:

From your question:

Customer Transactions: deposits,
deposit, customer, account, accounts

So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.

So now you take your sentence:

The system shall apply deposits to a
customer's specified account.

This has 2 words from the "Customer Transactions" set: {deposits, account, customer}

(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)

The vector for your sentence might be (1, 0, 1, 1, 0)

The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.

So we could say: how many times do these vectors differ? Lets compare:

(1,1,1,1,1)
(1,0,1,1,0)

Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.

(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)

Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.

(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)

good luck!

回复收藏 0 原文

遗弃Ｍ 2024-08-13 12:29:57

该问题的主要特征是：

外部定义的分类标准（关键字列表）
要分类的项目（需求文档中的行）由相对较少数量的属性值组成，有效地为单个维度：“关键字”。
根据定义，没有反馈/校准（尽管建议其中一些可能是适当的）

这些特征带来了好消息和坏消息：实施应该相对直接，但分类过程的一致准确度可能很难达到。此外，少量的各种数量（可能的类别数、项目中的最大/平均单词数等）应该为我们提供空间来选择可能需要 CPU 和/或空间的解决方案（如果需要）。

然而，即使这个许可证变得“花哨”，我建议从（并接近）一个简单的算法开始，并在此基础上扩展一些添加和考虑，同时对始终存在的称为过度拟合的危险保持警惕。

基本算法（概念性的，即此时不关注性能技巧）

   Parameters = 
     CatKWs = an array/hash of lists of strings.  The list contains the possible
              keywords, for a given category.
         usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
     NbCats = integer number of pre-defined categories
   Variables:
      CatAccu = an array/hash of numeric values with one entry per each of the
                possible categories.  usage:  CatAccu[3] = 4 (if array) or 
                 CatAccu['CustTx'] += 1  (hash)
      TotalKwOccurences = counts the total number of keywords matches (counts
       multiple when a word is found in several pre-defined categories)

    Pseudo code:  (for categorizing one input item)
       1. for x in 1 to NbCats
            CatAccu[x] = 0    // reset the accumulators
       2. for each word W in Item
             for each x in 1 to NbCats
                 if W found in CatKWs[x]
                      TotalKwOccurences++
                      CatAccu[x]++
       3. for each x in 1 to NbCats
             CatAccu[x] = CatAccu[x] / TotalKwOccurences  // calculate rating
       4. Sort CatAccu by value
       5. Return the ordered list of (CategoryID, rating)
              for all corresponding CatAccu[x] values about a given threshold.

简单但合理：我们倾向于匹配最多的类别，但我们除以匹配总数，作为一种方式当发现很多单词时降低置信度。请注意，这种划分不会影响给定项目的类别选择的相对排名，但在比较不同项目的评级时可能很重要。

现在，我想到了几个简单的改进：（我会认真考虑前两个，并考虑其他改进；每一个的决定都与项目的范围、数据的统计概况密切相关。被分类和其他因素...）

我们应该规范化从输入项读取的关键字和/或以容忍拼写错误的方式匹配它们。由于我们需要处理的单词很少，因此我们需要确保不会因为愚蠢的拼写错误而丢失重要的单词。
我们应该更加重视 CatKW 中出现频率较低的单词。例如，“帐户”一词应该少于“foo”或“信用”一词，
我们可以（但也许这没有用，甚至没有帮助）对具有较少[非噪音]的项目的评级给予更多的权重] 字。
我们还可以考虑基于二元图（两个连续的单词），因为对于自然语言（并且需求文档不太自然:-)），单词接近度通常是单词本身的更强指标。
我们可以为分配给前一个（甚至在前瞻逻辑中的后一个）项目的类别添加一点重要性。项目可能会出现在相关系列中，我们可以从这种规律性中受益。

此外，除了计算本身的评分之外，我们还应该考虑：

一些用于对算法结果本身进行评分的指标（待定）
一些收集与指定类别相关的单词列表并最终运行的逻辑对这些进行统计。这可以允许识别代表类别且最初未在 CatKW 中列出的单词。

指标问题应该尽早考虑，但这也需要输入项的参考集：某种“训练集”，即使我们正在处理预定义的字典类别关键字（通常使用训练集）确定这个类别关键字列表以及权重因子）。当然，这样的参考/训练集应该具有统计显着性和统计代表性[整个集]。

总结：坚持简单的方法，无论如何，上下文并没有留下非常花哨的空间。考虑引入一种测量特定算法（或给定算法中的特定参数）效率的方法，但请注意，此类指标可能存在缺陷，并提示您专门针对给定集合的解决方案损害其他项目（过度拟合）。

The main characteristics of the problem are:

Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)

These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.

Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.

Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)

   Parameters = 
     CatKWs = an array/hash of lists of strings.  The list contains the possible
              keywords, for a given category.
         usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
     NbCats = integer number of pre-defined categories
   Variables:
      CatAccu = an array/hash of numeric values with one entry per each of the
                possible categories.  usage:  CatAccu[3] = 4 (if array) or 
                 CatAccu['CustTx'] += 1  (hash)
      TotalKwOccurences = counts the total number of keywords matches (counts
       multiple when a word is found in several pre-defined categories)

    Pseudo code:  (for categorizing one input item)
       1. for x in 1 to NbCats
            CatAccu[x] = 0    // reset the accumulators
       2. for each word W in Item
             for each x in 1 to NbCats
                 if W found in CatKWs[x]
                      TotalKwOccurences++
                      CatAccu[x]++
       3. for each x in 1 to NbCats
             CatAccu[x] = CatAccu[x] / TotalKwOccurences  // calculate rating
       4. Sort CatAccu by value
       5. Return the ordered list of (CategoryID, rating)
              for all corresponding CatAccu[x] values about a given threshold.

Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.

Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)

We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.

Also, aside from the calculation of the rating per-se, we should also consider:

some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.

The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].

To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).

回复收藏 0 原文

疧_╮線 2024-08-13 12:29:57

我也面临着同样的问题，即仅基于关键字创建分类器。我有一个类关键字映射器文件，其中包含类变量和特定类中出现的关键字列表。我采用了以下算法，它运行得非常好。

# predictor algorithm
for docs in readContent:
    for x in range(len(docKywrdmppr)):
        catAccum[x]=0
    for i in range(len(docKywrdmppr)):
        for word in removeStopWords(docs):
            if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
                print(word)
                catAccum[i]=catAccum[i]+counter
    print(catAccum)
    ind=catAccum.index(max(catAccum))
    print(ind)
    predictedDoc.append(docKywrdmppr['Document Type'][ind])

I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.

# predictor algorithm
for docs in readContent:
    for x in range(len(docKywrdmppr)):
        catAccum[x]=0
    for i in range(len(docKywrdmppr)):
        for word in removeStopWords(docs):
            if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
                print(word)
                catAccum[i]=catAccum[i]+counter
    print(catAccum)
    ind=catAccum.index(max(catAccum))
    print(ind)
    predictedDoc.append(docKywrdmppr['Document Type'][ind])

回复收藏 0 原文

~没有更多了~