以编程方式从域名中提取关键字
假设我有一个想要分析的域名列表。除非域名用连字符连接,否则我没有看到一种特别简单的方法来“提取”域中使用的关键字。但我在 DomainTools.com、Estibot.com 等网站上看到了这种做法。例如:
ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...
对于高效且有效地完成此操作有什么建议吗?
编辑:我想用 PHP 写这个。
Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:
ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...
Any suggestions for accomplishing this efficiently and effectively?
Edit: I'd like to write this in PHP.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
好的,我运行了为这个问题编写的脚本,一些小的变化——使用日志概率来避免下溢,并修改它以读取多个文件作为语料库。
对于我的语料库,我从古腾堡项目下载了一堆文件——没有真正的方法,只是从 etext00、etext01 和 etext02 中获取所有英语文件。
下面是结果,我保留了每个组合的前三名。
Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.
For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.
Below are the results, I saved the top three for each combination.
可能想看看这个问题。
Might want to check out this SO question.
您需要开发一种启发式方法,从域中获取可能的匹配项。我要做的方法是首先找到一个大的文本语料库。例如,您可以下载维基百科。
接下来获取你的语料库,并组合每两个相邻的单词。例如,如果您的句子是:
您将创建一个列表:
其中每个列表的计数均为一。当您解析语料库时,您将跟踪每两个单词的频率对。此外,对于每一对,您需要对原始的两个单词进行排序。
按频率对此列表进行排序,然后尝试根据这些单词在您的域中查找匹配项。
最后,对未注册的前两个单词短语进行域名检查!
我认为像 DomainTool 这样的网站会列出排名最高的单词。然后他们首先尝试解析这些单词。根据目的,您可能需要考虑使用 MTurk 来完成这项工作。不同的人会对相同的单词进行不同的解析,并且可能不会与这些单词的常见程度成正比。
You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.
Next take your corpus, and combine every two adjacent words. For example, if your sentence is:
You'll create a list:
Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.
Sort this list by frequency, and then attempt to find matches in your domain based on these words.
Lastly, do a domain check for the top two word phrases which aren't registered!
I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.
选择spain.com
kidsexpress.com
童装网
dicksonweb.com
如果您打算尝试用字典解析 url,祝您玩得开心(还有一位好律师)。
如果您可以在他们的网站上找到相同的字符但以空格分隔,您可能会做得更好。
其他可能性:从 ssl 证书中提取数据;查询顶级域名服务器;
访问域名服务器(TLD);或使用“whois”工具或服务之一(只需谷歌“whois”)。
choosespain.com
kidsexpress.com
childrenswear.com
dicksonweb.com
Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.
You might do better if you can find the same characters but separated by white space on their web site.
Other possiblities: extract data from ssl certificate; query top level domain name server;
Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").
如果您有有效单词列表,则可以循环遍历域字符串,并尝试每次使用回溯算法截断有效单词。如果您成功用完所有单词,您就完成了。请注意,这的时间复杂度并不是最佳的:)
If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)
从 pspell 开始简单。您可能想要比较结果,看看是否得到了末尾没有“s”的单词的词干并将它们合并。
as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.
您必须针对域条目使用字典引擎来查找有效单词,并针对结果运行该字典引擎以确保结果是有效单词。
You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.