从文本描述中简单过滤掉常用词
像“a”、“the”、“best”、“kind”这样的词。我很确定有很好的方法可以实现这一点
只是要明确的是,我正在寻找
- 可以实现的最简单的解决方案,最好是在 ruby 中。
- 我对错误有很高的容忍度
- 如果我需要一个常用短语库,我对此也非常满意
Words like "a", "the", "best", "kind". i am pretty sure there are good ways of achieving this
Just to be clear, I am looking for
- The simplest solution that can be implemented, preferably in ruby.
- I have a high level of tolerance for errors
- If a library of common phrases is what i need, perfectly happy with that too
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这些常见单词被称为“停用词” - 这里有一个类似的 stackoverflow 问题:" 单词
总结一下:
如果您只是将这些单词放入程序中的哈希中,那么过滤任何单词列表应该很容易。
These common words are known as "stop words" - there is a similar stackoverflow question about this here: "Stop words" list for English?
To summarize:
If you just put these words into a hash in your program it should be easy to filter any list of words.
这是 DigitalRoss 答案的变体。
还相关:
检查一个字符串中的单词是否在另一个字符串中的最快方法是什么?
This is a variation on DigitalRoss answer.
Also relevant:
What's the fastest way to check if a word from one string is in another string?
等等,在删除停用词(又名干扰词、垃圾词)之前,您需要做一些研究。索引大小和处理资源并不是唯一的问题。很大程度上取决于最终用户是否会输入查询,或者您是否会处理长时间的自动查询。
所有搜索日志分析表明,人们倾向于在每个查询中输入一到三个单词。当搜索必须处理这些时,我们就不能失去任何东西。例如,一个集合可能在许多文档中包含“版权”一词,这使得它非常常见,但如果索引中没有单词,则无法进行精确的短语搜索或邻近相关性排名。此外,搜索最常见的单词也有完全合理的理由:人们可能正在寻找“The Who”,或更糟糕的是“The The”。
因此,虽然需要考虑技术问题,并且删除停用词是一种解决方案,但它可能不是您要解决的整体问题的正确解决方案。
Hold on, you need to do some research before you take out stopwords (aka noise words, junk words). Index size and processing resources aren't the only issues. A lot depends on whether end-users will be typing queries, or you will be working with long automated queries.
All search log analysis shows that people tend to type one to three words per query. When that's all a search has to work with, we can't afford to lose anything. For example, a collection might have the word "copyright" on many documents -- making it very common -- but if there's no word in the index, it's impossible to do exact phrase searches or proximity relevance ranking. In addition, there are perfectly legitimate reasons to search for the most common words: people may be looking for "The Who", or worse, "The The".
So while there are technical issues to consider, and taking out stopwords is one solution, it may not be the right solution for the overall problem that you are trying to solve.
如果您有一个要删除的名为
stop_words
的单词数组,那么您可以从此表达式得到结果:如果您想保留每个单词之间的非单词字符,
If you have an array of words to remove named
stop_words
, then you get the result from this expression:If you want to preserve the non-word characters between each word,