我有n个文档,想要找到这些文档中包含的常用单词。
例如我想说 (n-3) 个文档包含单词“web”。
当然,我可以通过基本数据结构来做到这一点,但也许有有效的算法或一种方法来处理具有不同后缀的相同单词。
有没有用于此类目的的算法?
我对数据挖掘世界不熟悉。一般来说,是否有一个术语用于寻找不同文档之间的相似性?如果有的话我会很容易地进行我的研究。
谢谢。
I have n documents and want to find common words that are included in these documents.
For example I want to say (n-3) documents include the word "web".
Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix.
Is there any algorithm for such purposes?
I am unfamiliar with datamining world. In general manner is there a term used for efforts of finding similarities between different documents? If there is then I will make my research easily.
Thanks.
发布评论
评论(2)
我想您正在谈论
词干提取
。如果您想使用 R 语言,则必须使用 tm包。如果没有,我只能建议这个 文本挖掘工具列表
I suppose that you are talking about
stemming
. If you want to use the R language, you'll have to work with the tm package.If not, I can only suggest this list of text mining tools
您可以通过生成一个包含每个文档计数的单词列表、按字母顺序对单词列表进行排序并比较两个列表来完成此操作。这是 O(nlgn)。
另一种方法是使用您选择的数据库提供的全文搜索。
You can do it by producing a word-list with counts for each document, sorting the word-list alphabetically, and comparing two lists. This is O(n lg n).
Another approach is to use the full text search as provided by your database of choice.