文本分类
我正在研究文本分类问题,我正在尝试将一组单词分类为类别,是的,有很多可用于分类的库,所以如果您建议使用它们,请不要回答。
让我解释一下我想要实现的内容。 (以为例)
单词列表:
- java
- 编程
- 语言
- c-sharp
类别列表。
- java
- c-sharp
在这里我们将训练该集合,如下:
- java 映射到类别 1. java
- 编程映射到类别 1.java
- 编程映射到类别 2.c-sharp
- 语言映射到类别 1.java
- 语言映射到类别 2.c -sharp
- c-sharp 映射到类别 2.c-sharp
现在我们有一个短语“The best javaprogramming book” 从给定的短语中,以下单词与我们的“单词列表”相匹配。:
- java
- 编程
“programming”有两个映射类别“java”和“java”。 “c-sharp”所以这是一个常用词。
“java”仅映射到类别“java”。
所以我们的短语匹配类别是“java”,
这就是我想到的,这个解决方案好吗,它可以实现吗,你的建议是什么,我错过了什么,缺陷等等。
I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.
Let me explain what I want to implement. ( take for example )
List of Words:
- java
- programming
- language
- c-sharp
List of Categories.
- java
- c-sharp
here we will train the set, as:
- java maps to category 1. java
- programming maps to category 1.java
- programming maps to category 2.c-sharp
- language maps to category 1.java
- language maps to category 2.c-sharp
- c-sharp maps to category 2.c-sharp
Now we have a phrase "The best java programming book"
from the given phrase following words are a match to our "List of Words.":
- java
- programming
"programming" has two mapped categories "java" & "c-sharp" so it is a common word.
"java" is mapped to category "java" only.
So our matching category for the phrase is "java"
This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当然这是可以实现的。如果您在正确的数据集(我猜是 Java 和 C# 编程书籍的标题)上训练朴素贝叶斯分类器或线性 SVM,它应该学会将术语“Java”与 Java、“C#”和“.NET”与 C# 相关联。 ,以及两者的“编程”。也就是说,如果数据集被均匀划分,朴素贝叶斯分类器可能会学习 Java 或 C# 等常用术语(如“编程”)的大致均匀概率。
Of course this can be implemented. If you train a Naive Bayes classifier or linear SVM on the right dataset (titles of Java and C# programming books, I guess), it should learn to associate the term "Java" with Java, "C#" and ".NET" with C#, and "programming" with both. I.e., a Naive Bayes classifier would likely learn a roughly even probability of Java or C# for common terms like "programming" if the dataset is divided evenly.
实现这一点的一个非常简单的方法是使用直接的 Lucene(或任何文本索引引擎)。创建一个包含所有“java”示例的 Lucene 文档,以及另一个包含“c#”示例的文档,并将两者添加到索引中。要对新文档进行分类,请对文档中的所有术语进行“或”操作,然后针对索引执行查询,并获取得分最高的类别。
A dirt simple way of implementing this is using straight-up Lucene (or any text-indexing engine). Create a single Lucene document with all of the "java" examples, and another document with the "c#" examples, and add both to the index. To classify a new document, OR all the terms in the document and execute a query against the index, and grab the category with the highest score.
如果可能的话,请阅读“集体智能编程”一书中“文档过滤”一章中的“朴素分类器”部分。虽然示例是用 Python 编写的,但我希望不会给您带来太大麻烦。
If possible then read the section called "A Naive Classifier" in chapter "Document Filtering" in book called "Programming Collective Intelligence". Although the examples are in Python, I hope that will not be of much trouble to you.