根据关键字组对文本进行分类?
我有一个软件项目的需求列表,是根据其前身的剩余部分汇总的。每项要求都应映射到一个或多个类别。每个类别都包含一组关键字。我想做的是找到一种算法,它可以给我一个分数,对每个需求可能属于的类别进行排名。结果将用作进一步对需求进行分类的起点。
举个例子,假设我有这样的需求:
系统将存款存入客户指定的账户。
和类别/关键字:
- 客户交易:存款、存款、客户、账户、账户
- 余额账户:账户、账户、借方、贷方
- 其他类别:foo、bar
我希望算法在类别 1 中对要求评分最高,在类别中评分较低2,并且根本不在类别 3 中。评分机制与我基本无关,但需要传达类别 1 比类别 2 更适用的可能性。
我是 NLP 新手,所以我有点不知所措。我一直在阅读《Python 中的自然语言处理》,并希望应用其中的一些概念,但还没有看到任何非常合适的内容。我认为简单的频率分布不起作用,因为我正在处理的文本非常小(单个句子)。
I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
- Customer Transactions: deposits, deposit, customer, account, accounts
- Balance Accounts: account, accounts, debits, credits
- Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能想查看“相似性度量”或“距离度量”的类别(在数据挖掘术语中,这与“分类”不同。)
基本上,相似性度量是一种数学方法,您可以:
通过相似性度量,这个数字是 0 到 1 之间的数字,其中“0”表示“根本没有匹配”,“1”表示“相同”,
因此您实际上可以将句子视为一个向量 - 并且您的句子中的每个单词句子代表该向量的一个元素。每个类别的关键字列表也是如此。
然后你可以做一些非常简单的事情:取“余弦相似度”或“Jaccard 指数"(取决于您构建数据的方式。)
这两个指标的作用是它们采用两个向量(您输入的句子和您的“关键字”列表)并给您一个数字。如果您在所有类别中执行此操作,则可以对这些数字进行排名,以查看哪个匹配具有最大的相似系数。
举个例子:
从你的问题来看:
因此,您可以构造一个包含 5 个元素的向量:(1, 1, 1, 1, 1)。这意味着,对于“客户交易”关键字,您有 5 个单词,并且(这听起来很明显)每个单词都出现在您的搜索字符串中。和我在一起。
所以现在你接受你的句子:
这有“客户交易”集中的 2 个单词:{deposits, account, customer}
(实际上,这说明了另一个细微差别:您实际上有“customer's”。这相当于“customer”吗?)
您的句子的向量可能是(1, 0, 1, 1, 0)
该向量中的 1 与第一个向量中的 1 的位置相同 - 因为这些单词是相同的。
所以我们可以说:这些向量有多少次不同?让我们比较一下:
(1,1,1,1,1)
(1,0,1,1,0)
嗯。他们在第 1、第 3 和第 4 个位置上有 3 次相同的“位”。它们仅相差 2 位。所以可以说,当我们比较这两个向量时,我们的“距离”为 2。恭喜,我们刚刚计算了 汉明距离!汉明距离越低,数据越“相似”。
(“相似性”度量和“距离”度量之间的区别在于前者是标准化的 - 它为您提供 0 到 1 之间的值。距离可以是任意数字,因此它只为您提供相对值。
) ,这可能不是进行自然语言处理的最佳方法,但就您的目的而言,它是最简单的,并且实际上可能非常适合您的应用程序,或者至少作为一个起点。
(PS:“分类”——正如你在标题中所提到的——将回答这个问题“如果你接受我的句子,它最有可能属于哪一类?”这与说“有多相似”有点不同我的句子属于第 1 类而不是第 2 类?”这似乎就是你所追求的。)
祝你好运!
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
该问题的主要特征是:
这些特征带来了好消息和坏消息:实施应该相对直接,但分类过程的一致准确度可能很难达到。此外,少量的各种数量(可能的类别数、项目中的最大/平均单词数等)应该为我们提供空间来选择可能需要 CPU 和/或空间的解决方案(如果需要)。
然而,即使这个许可证变得“花哨”,我建议从(并接近)一个简单的算法开始,并在此基础上扩展一些添加和考虑,同时对始终存在的称为过度拟合的危险保持警惕。
基本算法(概念性的,即此时不关注性能技巧)
简单但合理:我们倾向于匹配最多的类别,但我们除以匹配总数,作为一种方式当发现很多单词时降低置信度。请注意,这种划分不会影响给定项目的类别选择的相对排名,但在比较不同项目的评级时可能很重要。
现在,我想到了几个简单的改进:(我会认真考虑前两个,并考虑其他改进;每一个的决定都与项目的范围、数据的统计概况密切相关。被分类和其他因素...)
此外,除了计算本身的评分之外,我们还应该考虑:
指标问题应该尽早考虑,但这也需要输入项的参考集:某种“训练集”,即使我们正在处理预定义的字典类别关键字(通常使用训练集)确定这个类别关键字列表以及权重因子)。当然,这样的参考/训练集应该具有统计显着性和统计代表性[整个集]。
总结:坚持简单的方法,无论如何,上下文并没有留下非常花哨的空间。考虑引入一种测量特定算法(或给定算法中的特定参数)效率的方法,但请注意,此类指标可能存在缺陷,并提示您专门针对给定集合的解决方案损害其他项目(过度拟合)。
The main characteristics of the problem are:
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
Also, aside from the calculation of the rating per-se, we should also consider:
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
我也面临着同样的问题,即仅基于关键字创建分类器。我有一个类关键字映射器文件,其中包含类变量和特定类中出现的关键字列表。我采用了以下算法,它运行得非常好。
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.