提取域名中的单个现有单词
我正在寻找一个 Ruby gem(最好),它将域名切割成单词。
whatwomenwant.com => 3 words, "what", "women", "want".
如果它可以忽略数字和乱码之类的东西那就太好了。
I'm looking for a Ruby gem (preferably) that will cut domain names up into their words.
whatwomenwant.com => 3 words, "what", "women", "want".
If it can ignore things like numbers and gibberish then great.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要一个单词列表,例如由 Project Gutenberg 生成的或在源代码中提供的列表对于 ispell &c。然后,您可以使用以下代码将域分解为单词:
如果给定一个包含无法识别单词的短语,则返回一个空数组:
如果单词列表很长,则速度会很慢。您可以通过将单词列表预处理为树来使该算法更快。预处理本身需要时间,因此是否值得取决于您想要测试的域数量。
下面是一些将单词列表转换为树的代码:
这会生成一棵如下所示的树:
{:c=>{:h=>{:a=>{:n=>{:g=> ;{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, : e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word= >true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word= >true}}}}}}}}
它看起来像 Lisp,不是吗?树中的每个节点都是一个哈希值。每个哈希键要么是一个字母,其值为另一个节点,要么是符号 :word ,其值为 true。带有 :word 的节点是单词。
修改
words_that_phrase_begins_with
以使用新的树结构将使速度更快:You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:
If given a phrase that has any unrecognized words, it returns an empty array:
If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.
Here's some code to turn the word list into a tree:
This produces a tree that looks like this:
{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}
It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.
Modifying
words_that_phrase_begins_with
to use the new tree structure will make it faster:我不知道这方面的宝石,但如果我必须解决这个问题,我会下载一些英语单词词典并阅读有关文本搜索算法的内容。
当您有多个变体来分隔字母时(例如在 sepp2k 的 expertsexchange 中),您可以得到两个提示:
I don't know gems for this, but if I had to solve this problem, I would download some english words dictionary and read about text searching algorythms.
When you have more than one variant to divide letters (like in sepp2k's expertsexchange), than you can have two hints:
更新
I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)
基准:
运行时间:11 秒。
f-文件:13.000行域名
w-文件:2000字(用于检查)
代码:
Update
I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)
Benchmark:
Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)
Code: