如何清理和改进关键字列表?

发布于 2024-10-03 19:16:02 字数 1539 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

单挑你×的.吻 2024-10-10 19:16:02

您谈论的是“停用词”,它们是言论文章,例如“the”和“a”,以及经常遇到的毫无价值的单词。

存在停用词列表;如果我没记错的话,Wordnet 有一个,Lingua 或 Ruby Wordnet for Ruby可读性 模块,但实际上它们很容易自己生成。而且,您可能需要这样做,因为垃圾词根据特定主题而有所不同。

最简单的方法是使用几个示例文档运行初步传递,并将文本拆分为单词,然后循环遍历它们,并为每个单词增加一个计数器。完成后,查找长度为两到四个字母且计数较高的单词。这些是停用词的良好候选者。

然后遍历您的目标文档,像以前一样分割文本,同时计算出现的次数。您可以忽略停用词列表中的单词而不将它们添加到哈希中,或者处理所有内容然后删除停用词。

text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.

These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT

# do this against several documents to build a stopword list. Tweak as necessary to fine-tune the words.
stopwords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }.select{ |n,v| n.length < 5 }

print "Stopwords => ", stopwords.keys.sort.join(', '), "\n"

# >> Stopwords => 2606, 3, and, are, by, com, edu, for, have, in, into, net, not, or, org, page, rfc, see, this, use, web, you, your

然后,您准备好进行一些关键字收集:

text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.

These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT

stopwords = %w[2606 3 and are by com edu for have in into net not or org page rfc see this use web you your]

keywords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }
stopwords.each { |s| keywords.delete(s) }

# output in order of most often seen to least often seen.
keywords.keys.sort{ |a,b| keywords[b] <=> keywords[a] }.each { |k| puts "#{k} => #{keywords[k]}"}
# >> example => 4
# >> names => 1
# >> reached => 1
# >> browser => 1
# >> these => 1
# >> domain => 1
# >> typing => 1
# >> reserved => 1
# >> documentation => 1
# >> available => 1
# >> registration => 1
# >> section => 1

缩小单词列表范围后,您可以通过 WordNet 运行候选词并查找同义词、同音异义词、单词关系、剥离复数等。如果您这样做是为了您需要将大量文本保存在数据库中,以便您可以不断地对其进行微调。同样的事情也适用于您的关键字,因为您可以从这些关键字开始确定语气和其他语义优点。

You're talking about "stopwords", which are articles of speech, such as "the" and "a", plus words that are encountered so often that they are worthless.

Stopword lists exist; Wordnet has one if I remember right and there might be one in Lingua or the Ruby Wordnet for Ruby or readablity modules, but really they're pretty easy to generate yourself. And, you probably need to since the junk words vary depending on a particular subject matter.

The easiest thing to do is run a preliminary pass with several sample documents and split your text into words, then loop over them, and for each one increment a counter. When you're finished look for the words that are two to four letters long and are disproportionately higher counts. Those are good candidates for stopwords.

Then run passes over your target documents, splitting the text like you did previously, counting occurrences as you go. You can either ignore words in your stopword list and not add them to your hash, or process everything then delete the stopwords.

text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.

These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT

# do this against several documents to build a stopword list. Tweak as necessary to fine-tune the words.
stopwords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }.select{ |n,v| n.length < 5 }

print "Stopwords => ", stopwords.keys.sort.join(', '), "\n"

# >> Stopwords => 2606, 3, and, are, by, com, edu, for, have, in, into, net, not, or, org, page, rfc, see, this, use, web, you, your

Then, you're ready to do some keyword gathering:

text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.

These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT

stopwords = %w[2606 3 and are by com edu for have in into net not or org page rfc see this use web you your]

keywords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }
stopwords.each { |s| keywords.delete(s) }

# output in order of most often seen to least often seen.
keywords.keys.sort{ |a,b| keywords[b] <=> keywords[a] }.each { |k| puts "#{k} => #{keywords[k]}"}
# >> example => 4
# >> names => 1
# >> reached => 1
# >> browser => 1
# >> these => 1
# >> domain => 1
# >> typing => 1
# >> reserved => 1
# >> documentation => 1
# >> available => 1
# >> registration => 1
# >> section => 1

After you've narrowed down your list of words you can run the candidates through WordNet and find synonyms, homonyms, word relations, strip plurals, etc. If you're doing this to a whole lot of text you'll want to keep your stopwords in a database where you can continually fine-tune them. The same thing applies to your keywords, because from those you can start to determine tone and other semantic goodness.

洛阳烟雨空心柳 2024-10-10 19:16:02

顺便说一句,我决定走这条路:

bad_words = ["the", "a", "for", "on"] #etc etc
# Strip non alpha chars, and split into a temp array, then cut out the bad words
tmp_str = str.gsub(/[^A-Za-z0-9\s]/, "").split - bad_words
str = tmp_str.join(" ")

Btw, I decided to go this route:

bad_words = ["the", "a", "for", "on"] #etc etc
# Strip non alpha chars, and split into a temp array, then cut out the bad words
tmp_str = str.gsub(/[^A-Za-z0-9\s]/, "").split - bad_words
str = tmp_str.join(" ")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文