当前位置：文江博客话题详情

从文本描述中简单过滤掉常用词

发布于 2024-10-11 19:53:34 字数 177 浏览 14 评论 0原文

像“a”、“the”、“best”、“kind”这样的词。我很确定有很好的方法可以实现这一点

只是要明确的是，我正在寻找

可以实现的最简单的解决方案，最好是在 ruby 中。
我对错误有很高的容忍度
如果我需要一个常用短语库，我对此也非常满意

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遗心遗梦遗幸福 2024-10-18 19:53:34

这些常见单词被称为“停用词” - 这里有一个类似的 stackoverflow 问题：" 单词

总结一下：

如果您有大量文本需要处理，那么有必要收集有关该特定数据集中单词频率的统计数据，并获取最常见的你的停用词列表。（您在示例中包含“kind”，这表明您可能有一组非常不寻常的数据，例如有很多像“kind of”这样的口语表达，所以也许您需要这样做。）
既然您说您不太介意错误，那么只使用其他人生成的英语停用词列表就足够了，例如 MySQL 使用的相当长的内容或

如果您只是将这些单词放入程序中的哈希中，那么过滤任何单词列表应该很容易。

回复收藏 0 原文

蓝梦月影 2024-10-18 19:53:34

这是 DigitalRoss 答案的变体。

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

还相关：
检查一个字符串中的单词是否在另一个字符串中的最快方法是什么？

This is a variation on DigitalRoss answer.

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

Also relevant:
What's the fastest way to check if a word from one string is in another string?

回复收藏 0 原文

用心笑 2024-10-18 19:53:34

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

回复收藏 0 原文

年少掌心 2024-10-18 19:53:34

等等，在删除停用词（又名干扰词、垃圾词）之前，您需要做一些研究。索引大小和处理资源并不是唯一的问题。很大程度上取决于最终用户是否会输入查询，或者您是否会处理长时间的自动查询。

所有搜索日志分析表明，人们倾向于在每个查询中输入一到三个单词。当搜索必须处理这些时，我们就不能失去任何东西。例如，一个集合可能在许多文档中包含“版权”一词，这使得它非常常见，但如果索引中没有单词，则无法进行精确的短语搜索或邻近相关性排名。此外，搜索最常见的单词也有完全合理的理由：人们可能正在寻找“The Who”，或更糟糕的是“The The”。

因此，虽然需要考虑技术问题，并且删除停用词是一种解决方案，但它可能不是您要解决的整体问题的正确解决方案。

回复收藏 0 原文

(り薆情海 2024-10-18 19:53:34

如果您有一个要删除的名为 stop_words 的单词数组，那么您可以从此表达式得到结果：

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

如果您想保留每个单词之间的非单词字符，

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

If you have an array of words to remove named stop_words, then you get the result from this expression:

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

If you want to preserve the non-word characters between each word,

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

回复收藏 0 原文

~没有更多了~

关于作者

病女

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

从文本描述中简单过滤掉常用词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

著墨染雨君画夕

屋檐

最后的乘客

眼前雾蒙蒙

kidking

kill136

友情链接

从文本描述中简单过滤掉常用词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

著墨染雨君画夕

屋檐

最后的乘客

眼前雾蒙蒙

kidking

kill136

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。