提取域名中的单个现有单词

发布于 2024-10-05 08:55:59 字数 154 浏览 0 评论 0原文

我正在寻找一个 Ruby gem（最好），它将域名切割成单词。

whatwomenwant.com => 3 words, "what", "women", "want".

如果它可以忽略数字和乱码之类的东西那就太好了。

原文

I'm looking for a Ruby gem (preferably) that will cut domain names up into their words.

whatwomenwant.com => 3 words, "what", "women", "want".

If it can ignore things like numbers and gibberish then great.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你不是我要的菜∠ 2024-10-12 08:55:59

您需要一个单词列表，例如由 Project Gutenberg 生成的或在源代码中提供的列表对于 ispell &c。然后，您可以使用以下代码将域分解为单词：

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

如果给定一个包含无法识别单词的短语，则返回一个空数组：

p phrase_to_words('expertsfoo')
# => []

如果单词列表很长，则速度会很慢。您可以通过将单词列表预处理为树来使该算法更快。预处理本身需要时间，因此是否值得取决于您想要测试的域数量。

下面是一些将单词列表转换为树的代码：

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

这会生成一棵如下所示的树：

{:c=>{:h=>{:a=>{:n=>{:g=> ;{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, : e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word= >true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word= >true}}}}}}}}

它看起来像 Lisp，不是吗？树中的每个节点都是一个哈希值。每个哈希键要么是一个字母，其值为另一个节点，要么是符号 :word ，其值为 true。带有 :word 的节点是单词。

修改 words_that_phrase_begins_with 以使用新的树结构将使速度更快：

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end

You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

If given a phrase that has any unrecognized words, it returns an empty array:

p phrase_to_words('expertsfoo')
# => []

If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.

Here's some code to turn the word list into a tree:

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

This produces a tree that looks like this:

{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}

It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.

Modifying words_that_phrase_begins_with to use the new tree structure will make it faster:

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end

回复收藏 0 原文

许一世地老天荒 2024-10-12 08:55:59

我不知道这方面的宝石，但如果我必须解决这个问题，我会下载一些英语单词词典并阅读有关文本搜索算法的内容。

当您有多个变体来分隔字母时（例如在 sepp2k 的 expertsexchange 中），您可以得到两个提示：

您的字典按...排序，例如，单词的流行度< /em>.所以用最流行的词来划分会更有价值。
您可以转到您正在分析的域的网站的主页，然后阅读内容，搜索您的单词。我认为您不会在某些专家的页面上找到性。但是...嗯...专家可以如此不同,.)

回复收藏 0 原文

冷了相思 2024-10-12 08:55:59

更新

I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)

基准：

运行时间：11 秒。
f-文件：13.000行域名
w-文件：2000字（用于检查）

代码：

f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results

Update

I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)

Benchmark:

Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)

Code:

f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results

回复收藏 0 原文

~没有更多了~