“停用词” 英语清单?

发布于 2024-07-30 08:26:06 字数 1460 浏览 16 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

奈何桥上唱咆哮 2024-08-06 08:26:06

放入谷歌的神奇词是“停用词”。 这会出现一个看起来合理的列表

MySQL 还有一个内置停用词列表 ,但这对我来说太全面了。 例如,在我们的大学图书馆,我们遇到了问题,因为“第三世界”中的“第三”被认为是停用词。

The magic word to put into Google is "stop words". This turns up a reasonable-looking list.

MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.

不气馁 2024-08-06 08:26:06

这些称为 停用词,检查此示例

these are called stop words, check this sample

拒绝两难 2024-08-06 08:26:06

根据您正在使用的英语子域,您可能已经/希望编译自己的停用词列表。 一些通用停用词在某个领域可能是有意义的。 例如,单词“are”实际上可能是某个域中的缩写/首字母缩略词。 相反,您可能希望忽略一些特定领域的单词取决于您的应用程序,而在通用英语领域您可能不想忽略这些单词。 例如,如果您正在分析医院报告的语料库,您可能希望忽略“历史”和“症状”等词语,因为它们会在每个报告中找到,并且可能没有用(从普通倒排索引的角度来看)。

否则,Google 返回的列表应该没问题。 Porter Stemmer 使用此 和 Lucene 搜索引擎实现 使用这个

Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).

Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.

困倦 2024-08-06 08:26:06

获取大型 txt 语料库中词频的统计信息。 忽略频率>的所有单词 一些数字。

Get statistics about word frequency in large txt corpora. Ignore all words with frequency > some number.

只想待在家 2024-08-06 08:26:06

我想我在构建时使用了 此处 的德语停用词列表不久前使用 lucene.net 开发的搜索应用程序。 该站点也包含一个英语列表,并且该站点上的列表显然也是 lucene 项目默认使用的列表。

I think I used the stopword list for German from here when I built a search application with lucene.net a while ago. The site contains a list for English, too, and the lists on the site are apparaently the ones that the lucene project use as default, too.

十年九夏 2024-08-06 08:26:06

通常这些词出现在文档中的频率最高。
假设您有一个全局单词列表:

{ Word Count }

对于单词列表,如果您将单词从最高计数到最低计数排序,您将得到一个图表(计数(y 轴)和单词(x 轴),即逆对数)所有停用词都位于左侧,并且“停用词”的停止点将位于最高一阶导数存在的位置。

此解决方案比字典尝试更好:

  • 此解决方案是一种通用方法。不受语言约束
  • 此尝试了解哪些单词被视为“停用词”
  • 此尝试将为非常相似的集合产生更好的结果,并为集合中的项目生成唯一的单词列表
  • 停用词可以在以后重新计算(通过此功能,可以进行缓存并统计确定停用词与计算时相比可能已发生变化)
  • 这还可以消除基于时间或非正式的单词和名称(例如俚语,或者如果您有一堆文档公司名称作为标题)

字典尝试更好:

  • 查找时间更快
  • 结果被预先缓存
  • 它很简单
  • 其他人提出了停用词。

Typically these words will appear in documents with the highest frequency.
Assuming you have a global list of words:

{ Word Count }

With the list of words, if you ordered the words from the highest count to the lowest, you would have a graph (count (y axis) and word (x axis) that is the inverse log function. All of the stop words would be at the left, and the stopping point of the "stop words" would be at where the highest 1st derivative exists.

This solution is better than a dictionary attempt:

  • This solution is a universal approach that is not bound by language
  • This attempt learns what words are deemed to be "stop words"
  • This attempt will produce better results for collections that are very similar, and produce unique word listings for items in the collections
  • The stop words can be recalculated at a later time (with this there can be caching and a statistical determination that the stop words may have changed from when they were calculated)
  • This can also eliminate time based or informal words and names (such as slang, or if you had a bunch of documents that had a company name as a header)

The dictionary attempt is better:

  • The lookup time is much faster
  • The results are precached
  • Its simple
  • Some else came up with the stop words.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文