主题相关爬虫的字典是如何定义的?
我想知道定义字典来计算特定网站的相关性的最佳方法是什么。至少带有单词的词典似乎是衡量通过链接找到的新网站的相关性的重要方法(例如,如果链接到一个网站,但它不包含任何有关足球的单词,则它可能与我的足球爬虫无关)。
我提出了以下想法,但它们都有很大的缺点:
- 手写字典 ->你可能会忘记很多单词,而且非常耗时
- 从第一个网站中取出最重要的单词作为字典 ->很多单词可能会丢失
- 将所有网站上最重要的单词作为字典中的条目,并根据相关性对它们进行加权(例如,相关性仅为 0.4 的网站不会像一个网站那样对字典产生如此大的影响)相关 0.8) ->看起来相当复杂,可能会导致意想不到的结果
最后一种方法对我来说似乎是最好的,但也许有更好、更常见的方法?
I am wondering what is the best method to define a dictionary to calculate relevance of a specific website. At least dictionaries with words seem to be an important method of measuring relevance for new websites found via links (e.g. if a website is linked to, but it does not contain any word about soccer, it is probably irrelevant for my soccer crawler).
I came to the following ideas, but all of them have major drawbacks:
- Write a dictionary by hand -> you might forget a lot of words and it is very time consuming
- Take the most important words from the first website as dictionary -> a lot of words would probably be missing
- Take the most important words on all websites as entries in the dictionary and weight them by relevance (e.g. a website which is only relevant 0.4 would not have such a big impact on the dictionary as a website that is relevant 0.8) -> seems pretty complicated and could lead to unexpected results
The last method seems the best to me, but maybe there are better and more common methods?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议您从已知站点列表中构建一个常用词词典。假设您有 100 个网站,并且您知道它们都在谈论足球。您可以构建内容的一元图和二元图(或 n 元图),并将其用作基线,根据该基线测量您所做的每个新观察的某种类型的“偏差”。请注意,您必须删除常见的停用词才能消除不相关的单词;英文有很多,这里有一个列表: http://www.ranks.nl/ resources/stopwords.html
N-gram 是单词或单词组合的频率计数。 Unigrams 创建一个映射,其中键是单词,值是每个单词出现的次数。二元模型通常是通过组合两个连续的单词并将它们用作关键字来构建的,三元模型和 n 元模型也是如此。
您可以从已知网站中获取排名靠前的 n 元语法,并将它们与您当前正在评估的网站的排名靠前的 n 元语法进行比较。它们越相似,该网站就越有可能具有相同的主题。
I would recommend that you build a common-word dictionary from a list of known sites. Suppose you have 100 sites and you know that they're all talking about soccer. You can build unigram and bigram (or n-gram) maps of the content and use it as a baseline from which you measure some type of "deviation" with regards to every new observation you make. Note that you would have to remove the common stopwords in order to eliminate irrelevant words; in English there are quite a few, here is a list: http://www.ranks.nl/resources/stopwords.html
N-grams are frequency counts of words or combinations of words. Unigrams creates a map where the key is the word and the value is the number of occurrence for each word. Bigrams are usually constructed by combining two consecutive words and using them as the key, so forth for trigrams and n-grams.
You can take the top n-grams from your known sites and compare them against the top n-grams of the site you're currently evaluating. The more similar they are, the more likely that the site is with the same topic.