提取域名中的单个现有单词

发布于 2024-10-05 08:55:59 字数 154 浏览 0 评论 0原文

我正在寻找一个 Ruby gem(最好),它将域名切割成单词。

whatwomenwant.com => 3 words, "what", "women", "want".

如果它可以忽略数字和乱码之类的东西那就太好了。

I'm looking for a Ruby gem (preferably) that will cut domain names up into their words.

whatwomenwant.com => 3 words, "what", "women", "want".

If it can ignore things like numbers and gibberish then great.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

你不是我要的菜∠ 2024-10-12 08:55:59

您需要一个单词列表,例如由 Project Gutenberg 生成的或在源代码中提供的列表对于 ispell &c。然后,您可以使用以下代码将域分解为单词:

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

如果给定一个包含无法识别单词的短语,则返回一个空数组:

p phrase_to_words('expertsfoo')
# => []

如果单词列表很长,则速度会很慢。您可以通过将单词列表预处理为树来使该算法更快。预处理本身需要时间,因此是否值得取决于您想要测试的域数量。

下面是一些将单词列表转换为树的代码:

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

这会生成一棵如下所示的树:

{:c=>{:h=>{:a=>{:n=>{:g=> ;{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, : e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word= >true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word= >true}}}}}}}}

它看起来像 Lisp,不是吗?树中的每个节点都是一个哈希值。每个哈希键要么是一个字母,其值为另一个节点,要么是符号 :word ,其值为 true。带有 :word 的节点是单词。

修改 words_that_phrase_begins_with 以使用新的树结构将使速度更快:

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end

You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

If given a phrase that has any unrecognized words, it returns an empty array:

p phrase_to_words('expertsfoo')
# => []

If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.

Here's some code to turn the word list into a tree:

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

This produces a tree that looks like this:

{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}

It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.

Modifying words_that_phrase_begins_with to use the new tree structure will make it faster:

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end
许一世地老天荒 2024-10-12 08:55:59

我不知道这方面的宝石,但如果我必须解决这个问题,我会下载一些英语单词词典并阅读有关文本搜索算法的内容。

当您有多个变体来分隔字母时(例如在 sepp2k 的 expertsexchange 中),您可以得到两个提示:

  1. 您的字典按...排序,例如,单词的流行度< /em>.所以用最流行的词来划分会更有价值。
  2. 您可以转到您正在分析的域的网站的主页,然后阅读内容,搜索您的单词。我认为您不会在某些专家的页面上找到。但是...嗯...专家可以如此不同,.)

I don't know gems for this, but if I had to solve this problem, I would download some english words dictionary and read about text searching algorythms.

When you have more than one variant to divide letters (like in sepp2k's expertsexchange), than you can have two hints:

  1. Your dictionary is sorted by... for example, popularity of a word. So dividings with most popular words will be more valuable.
  2. You can go to the main page of site with domain you are anazyling and just read the content, searching your words. I don't think that you'll find sex on a page for some experts. But... hm... experts can be so different ,.)
冷了相思 2024-10-12 08:55:59

更新


I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)

基准:


运行时间:11 秒。
f-文件:13.000行域名
w-文件:2000字(用于检查)

代码:


f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results

Update


I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)

Benchmark:


Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)

Code:


f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文